I have a list of tensors with different shapes; the list is denoted by examples_train, and the corresponding label list is labels_train; there are about 1200 examples in the list examples_train. I want to fit my model on this list. I cannot use data loader since the tensors in the list have different shapes.
My current method is to compute loss of the examples one by one, and sum up the losses to do backpropagation. The code is:
training_indices = np.arange(len(examples_train))
np.random.shuffle(training_indices) #shuffle the examples
optimizer.zero_grad()
loss = 0.
for idx in training_indices:
example = example_train[idx]
label = labels_train[idx]
#Add batch dimension
example = example.unsqueeze(0)
label = label.unsqueeze(0)
example = example.cuda()
label = label.cuda()
logits = model(example)
current_loss = F.cross_entropy(logits, label)
loss += current_loss
loss = loss/len(examples_train)
loss.backward()
optimizer.step()
The code above works, but sometimes there might be “CUDA out of memory” error, and the optimisation is also very slow. How can I fix this problem? Should I partition the list examples_train into smaller sub lists(mini batches)?
can you use dataloader for variable length data?
Definitely yes, you should implement a collate_fn for the dataloader (commonly we pad the data to fit the size and be arranged as mini-batch)
why does the GPU memory increase in iterations?
the line loss +=... makes torch store all intermediate tensors from step 1 until you call backward ()
Thanks for your reply.
Actually in my case each example need to multiply with another corresponding tensor in the forward process, which I didn’t mention above, so it still seems impossible to use dataloader.
For the GPU memory increasing problem, can I partition the examples_train list into small lists(mini-batches) and do backpropagation for each list to solve this problem?
Actually my code is to implement a graph neural network and the examples are graphs including different number of nodes. There are node2edge and edge2node operations in the forward process, which need matrix which is specific for each graph.
In your case, if it’s impossible to batch the data. Using gradient accumulating is OK. But as you mentioned, you should partition the entire dataset into mini-batch size, and accumulating gradients for limited steps (the same as “batch size”), otherwise the GPU memory usage is going to explode.
Your don’t need to manually delete the loss variable. After loss.backward(), the computation graph is released by pytorch itself.
And just a hint, accumulating gradients by step is not identical to mini-batching update when there are layers like batch-norm in your network.