Hello all,
I’m training several models one after the other and my GPU cached memory is being increased during training of the second model.
Note that when using the same code but train only one model with many epochs this doesnt happen.
So it somehow related to replacing trained model in the GPU.
For simplicity I used the same architecture, optimizer, etc for all trained model.
My code for is:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
for model_name, training_obj in all_models.items():
num_epochs = len(training_obj['lr_array'])
log_message('--------------Training model: ' + model_name + ' -------------------------------')
optimizer = training_obj['optimizer']
train_loader = training_obj['train_loader']
criterion = training_obj['criterion']
model = training_obj.get('model')
model.to(device)
model.train()
epochs_to_run = range(num_epochs)
# Loop over epochs to run and train the model
for epoch in epochs_to_run:
torch.cuda.empty_cache()
# Loop over mini batches
for i_batch, (images, labels) in enumerate(train_loader):
images = images.to(device)
labels = labels.to(device)
# Forward pass
outputs = model(images).squeeze()
optimizer.zero_grad()
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()