Any tips for huge sequence length variation performance

I have a working forward model that takes sequence data and labels it. In looking for any possible performance improvements in my code I did run across the suggestion that sequence data could be preallocated to avoid memory fragmentation (an issue for me, that leads to OOM errors if I don’t clear the cache on every batch of the training loop).

The problem is that the max sequence is 3x the minimum sequence. And even if I preallocate the tensors, now the model for every batch has to process the longest possible sequence, which greatly slows down the overall training (completely wiping out the benefit of preallocation - and then some).

I’ve tried just sending the relevant portion of the data of each batch through the model, but that won’t work because all view calls complain that the shape of the “necessary” tensor isn’t compatible with the preallocated tensor.

Any other suggestions that I could try? Again, the model works. It’s just that I have to clear the cache every batch or I’ll get OOM errors from all the fragmentation, which I know is having some negative effect.

Just for clarification - by OOM do you mean for the GPU memory or regular RAM?

OOM on the GPU memory

Can you show the structure of your training loop (over epochs and over iterations)? There might be room to restructure things to make memory release earlier.