PyTorch Lightning: A minimal code#
This is minimal PL code to train a sequential model on MNIST. We are only using lightning.LightningModule and lightning.Trainer.
There is no validation step, so PL complains
You are using the plain ModelCheckpoint callback. Consider using LitModelCheckpoint which with seamless uploading to Model registry.
and later
You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
The checkpoint is stored from the last epoch as there is no validation step to determine the best model.
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
from torchvision.datasets import MNIST
from torchvision import transforms
import lightning as L
print("Lightning version:", L.__version__)
print("GPU name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU available")
print("Torch version:", torch.__version__)
print("CUDA is available:", torch.cuda.is_available())
Lightning version: 2.5.1
GPU name: NVIDIA RTX A5000
Torch version: 2.6.0+cu124
CUDA is available: True
class LitModel(L.LightningModule): # a replacesment of nn.Module
def __init__(self):
super().__init__() # call __init__ of the super class to init important LightningModule functions
self.model = nn.Sequential(
nn.Flatten(),
nn.Linear(28*28, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
x, y = batch
logits = self(x)
loss = F.cross_entropy(logits, y)
self.log("train_loss", loss)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=1e-3) # the NN get the parameters not self.model.parameters()
# Data
transform = transforms.ToTensor()
dataset = MNIST(root = "./MNIST", download = True, train = True, transform = transform)
train_ds, val_ds = random_split(dataset, [55000, 5000])
train_loader = DataLoader(train_ds, batch_size=64)
val_loader = DataLoader(val_ds, batch_size=64)
100%|██████████| 9.91M/9.91M [00:00<00:00, 10.4MB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 290kB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 2.62MB/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 8.96MB/s]
# trainer
model = LitModel()
trainer = L.Trainer(max_epochs = 3, accelerator="auto") # auto will select gpu if available
trainer.fit(model, train_loader, val_loader)
You are using the plain ModelCheckpoint callback. Consider using LitModelCheckpoint which with seamless uploading to Model registry.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/home/hell/Desktop/lightning/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:76: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/home/hell/Desktop/lightning/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/configuration_validator.py:68: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
You are using a CUDA device ('NVIDIA RTX A5000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params | Mode
---------------------------------------------
0 | model | Sequential | 101 K | train
---------------------------------------------
101 K Trainable params
0 Non-trainable params
101 K Total params
0.407 Total estimated model params size (MB)
5 Modules in train mode
0 Modules in eval mode
/home/hell/Desktop/lightning/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=19` in the `DataLoader` to improve performance.
Epoch 2: 100%|██████████| 860/860 [00:13<00:00, 65.77it/s, v_num=0]
`Trainer.fit` stopped: `max_epochs=3` reached.
Epoch 2: 100%|██████████| 860/860 [00:13<00:00, 65.73it/s, v_num=0]