I made my code in TensorFlow and it works well for training and validation with 16 GB ram T4 gpu but then i convert it in pytorch and now its showing that 22 GB L4 gpu is still not enough for training in PyTorch, Does anyone have any idea whats going on?
Without any information it’s impossible to know what’s going on.
Hi,
Here my training and validation setup
“”"
device = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)
Initialize model, optimizer, and scheduler
model = YOLOv3Small().to(device) # Move model to GPU
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
scheduler = CyclicLR(optimizer, base_lr=1e-5, max_lr=1e-2, step_size_up=108, mode=‘triangular2’)
Initialize mixed-precision scaler
scaler = GradScaler()
num_epochs = 100
train_epoch_losses =
val_epoch_losses =
Train_loaders = [DataLoader(TensorDataset(Train_Images, Train_Labels), batch_size=batch_size, shuffle=True)]
Train_loaders = [DataLoader(TensorDataset(Val_Images, Val_Labels_1_s), batch_size=batch_size, shuffle=True)]
for epoch in range(2):
model.train()
total_train_loss = 0.0
num_train_batches = 0
for Train_loader in Train_loaders:
for images, labels in Train_loader:
images, labels = images.to(device), labels.to(device) # Ensure data is on the GPU
optimizer.zero_grad()
with autocast(): # Use mixed-precision
output = model(images)
conf_loss, ciou_loss = YoloLoss_s_pt(Y_true=labels, Y_Pred=output)
total_loss = conf_loss + ciou_loss
# Gradient accumulation (if applicable)
total_loss.backward()
if (num_train_batches + 1) % accumulation_steps == 0:
optimizer.step()
# Accumulate loss and batch count
total_train_loss += total_loss.item()
num_train_batches += 1
# Clear GPU cache to avoid fragmentation
torch.cuda.empty_cache()
avg_train_loss = total_train_loss / num_train_batches
train_epoch_losses.append(avg_train_loss)
model.eval()
total_val_losses = 0.0
num_val_batches = 0
with torch.no_grad():
for Val_loader in Valid_loaders:
for images, labels in Val_loader:
images, labels = images.to(device), labels.to(device) # Ensure data is on the GPU
with autocast(): # Use mixed-precision during validation as well
output = model(images)
conf_loss, ciou_loss = YoloLoss_s_pt(Y_true=labels, Y_Pred=output)
total_loss = conf_loss + ciou_loss
total_val_losses += total_loss.item()
num_val_batches += 1
avg_val_loss = total_val_losses / num_val_batches
val_epoch_losses.append(avg_val_loss)
# Save best model based on validation loss
if avg_val_loss == min(val_epoch_losses):
torch.save(model.state_dict(), "best_model_1st.pth")
print(f"Best Loss at epoch {epoch + 1} with valid loss: {avg_val_loss:4f}")
# Update learning rate
current_lr = scheduler.get_last_lr()[0]
print(f"Epoch {epoch + 1}/{num_epochs} - Train Loss: {avg_train_loss:4f}, Val Loss: {avg_val_loss}, LR: {current_lr:8f}")
# Update scheduler
scheduler.step()
# Clear the GPU cache at the end of each epoch
torch.cuda.empty_cache()
“”"
and this is the output:
:13: FutureWarning: torch.cuda.amp.GradScaler(args...)
is deprecated. Please use torch.amp.GradScaler('cuda', args...)
instead.
scaler = GradScaler()
:30: FutureWarning: torch.cuda.amp.autocast(args...)
is deprecated. Please use torch.amp.autocast('cuda', args...)
instead.
with autocast(): # Use mixed-precision
OutOfMemoryError Traceback (most recent call last)
in <cell line: 0>()
29
30 with autocast(): # Use mixed-precision
—> 31 output = model(images)
32 conf_loss, ciou_loss = YoloLoss_s_pt(Y_true=labels, Y_Pred=output)
33 total_loss = conf_loss + ciou_loss
12 frames
/usr/local/lib/python3.11/dist-packages/torch/nn/functional.py in leaky_relu(input, negative_slope, inplace)
1900 result = torch._C.nn.leaky_relu(input, negative_slope)
1901 else:
→ 1902 result = torch._C._nn.leaky_relu(input, negative_slope)
1903 return result
1904
OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 49.06 MiB is free. Process 34100 has 14.70 GiB memory in use. Of the allocated memory 14.54 GiB is allocated by PyTorch, and 36.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.5 documentation)
Hello,
It sounds like the conversion from TensorFlow to PyTorch might have introduced additional memory requirements. PyTorch and TensorFlow handle memory management differently, which can lead to increased memory usage.
To address this, you can try: myLoneStar
- Optimizing your PyTorch code: Ensure you’re using efficient data structures and operations.
- Reducing batch size: Smaller batches require less memory.
- Using mixed-precision training: This can reduce memory usage while maintaining performance.
Best Regards,
Thomas Brown