PyTorch Lightning: A minimal code#

This is minimal PL code to train a sequential model on MNIST. We are only using lightning.LightningModule and lightning.Trainer.

  • There is no validation step, so PL complains

You are using the plain ModelCheckpoint callback. Consider using LitModelCheckpoint which with seamless uploading to Model registry.

and later

You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
  • The checkpoint is stored from the last epoch as there is no validation step to determine the best model.

import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
from torchvision.datasets import MNIST
from torchvision import transforms
import lightning as L
print("Lightning version:", L.__version__)
print("GPU name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU available")
print("Torch version:", torch.__version__)
print("CUDA is available:", torch.cuda.is_available())
Lightning version: 2.5.1
GPU name: NVIDIA RTX A5000
Torch version: 2.6.0+cu124
CUDA is available: True
class LitModel(L.LightningModule): # a replacesment of nn.Module
   def __init__(self):
      super().__init__() # call __init__ of the super class to init important LightningModule functions
      self.model = nn.Sequential(
         nn.Flatten(),
         nn.Linear(28*28, 128),
         nn.ReLU(),
         nn.Linear(128, 10)
      )
   
   def forward(self, x):
      return self.model(x)
   
   def training_step(self, batch, batch_idx):
      x, y = batch
      logits = self(x)
      loss = F.cross_entropy(logits, y)
      self.log("train_loss", loss)
      return loss
   def configure_optimizers(self):
      return torch.optim.Adam(self.parameters(), lr=1e-3) # the NN get the parameters not self.model.parameters()
# Data
transform = transforms.ToTensor()
dataset = MNIST(root = "./MNIST", download = True, train = True, transform = transform)
train_ds, val_ds = random_split(dataset, [55000, 5000])
train_loader = DataLoader(train_ds, batch_size=64)
val_loader = DataLoader(val_ds, batch_size=64)
100%|██████████| 9.91M/9.91M [00:00<00:00, 10.4MB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 290kB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 2.62MB/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 8.96MB/s]
# trainer
model = LitModel()
trainer = L.Trainer(max_epochs = 3, accelerator="auto") # auto will select gpu if available
trainer.fit(model, train_loader, val_loader)
You are using the plain ModelCheckpoint callback. Consider using LitModelCheckpoint which with seamless uploading to Model registry.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/home/hell/Desktop/lightning/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:76: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/home/hell/Desktop/lightning/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/configuration_validator.py:68: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
You are using a CUDA device ('NVIDIA RTX A5000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type       | Params | Mode 
---------------------------------------------
0 | model | Sequential | 101 K  | train
---------------------------------------------
101 K     Trainable params
0         Non-trainable params
101 K     Total params
0.407     Total estimated model params size (MB)
5         Modules in train mode
0         Modules in eval mode
/home/hell/Desktop/lightning/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=19` in the `DataLoader` to improve performance.
Epoch 2: 100%|██████████| 860/860 [00:13<00:00, 65.77it/s, v_num=0]
`Trainer.fit` stopped: `max_epochs=3` reached.
Epoch 2: 100%|██████████| 860/860 [00:13<00:00, 65.73it/s, v_num=0]