Thank you a lot for your answers. I am able to fit for high batch sizes (1028) my maximum length sequences (around 800). For now I’ll stick to a small batch size so that I avoid this effect of having huge different intermediate activations cached and different batches sizes too.
Can I ask you one last question please, is there any way to the inner workings of the memory management please? I couldn’t understand well just from PyTorch’s documentation.
I have run the following loop on my smaller dataset, with a variable max sequence length, just to iterate fast and see how that will affect the allocated and cached memory:
print(torch.cuda.memory_allocated()/1024**2)
print(torch.cuda.memory_cached()/1024**2)
print()
for batch in data_loader:
examples, labels = batch
examples = torch.squeeze(examples)
print(examples.size())
examples = examples.to(device)
print(torch.cuda.memory_allocated()/1024**2)
print(torch.cuda.memory_cached()/1024**2)
print("-----")
And I got the following
0.0
0.0
/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py:416: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
warnings.warn(
torch.Size([4092, 21])
0.65576171875
2.0
-----
torch.Size([4092, 18])
1.2177734375
2.0
-----
torch.Size([4092, 19])
1.1552734375
2.0
-----
torch.Size([4092, 21])
1.2490234375
2.0
-----
torch.Size([4092, 23])
1.3740234375
2.0
-----
torch.Size([4092, 30])
1.6552734375
2.0
-----
torch.Size([4092, 35])
2.02978515625
22.0
-----
torch.Size([357, 48])
1.06787109375
22.0
-----
I would love tu understand why the allocated memory decreased going from torch.Size([4092, 35])
to torch.Size([357, 48])
and I’d love to be able to compute by myself when the cached memory will increase (like why at torch.Size([4092, 35])
and not torch.Size([4092, 30])
). (All batches have the same data type torch.int64
)
Thank you a lot again, it feels satisfying to be able to pinpoint the issue ^^