Loading Sequential Data of Varying Lengths into GPU

akshathsa · April 11, 2023, 4:00pm

Hi, so I have time series data collected on different individuals of varying lengths. I’m trying to train an LSTM on that data using PyTorch. I followed some threads on here to successfully implement a first attempt. I extended the Dataset class to which takes in my data as a Pandas series containing numpy arrays and correctly returns (features, labels) tuples of arrays which correspond to the data for each individual. Then I use a DataLoader with a custom collate function which uses pack_sequence to pad the arrays on a per-batch basis and pack them. I want to pad the data to the length of the longest sequence in a mini-batch instead so I avoid unnecessarily padding all of the data to the length of some outlier series. From my understanding, this is still all on the CPU. Then in my training loop when I get each mini-batch, I send them to the GPU. Using CUDA in this fashion has cut my training time by a factor of 5, but I’m wondering if all the data transfers from the CPU to GPU are still slowing it down.

Since my dataset is relatively small, I wanted to try putting it all on the GPU and see if that improves running time. From what I’ve seen online, a good way to achieve this is to send it all to the GPU in the init method of the Dataset class I’m using. However, from my understanding, data has to be in PyTorch Tensors if I want to send them to the GPU, and since the data isn’t padded at this point yet and I can’t stack Tensors of different lengths, I’m running into a problem.

Is there a solution to this or perhaps a better way to achieve what I’m trying to do? I’m relatively new to PyTorch, so I may very well be unaware of better options. Thank you in advance for any and all help!

ptrblck · April 12, 2023, 5:59am

I would recommend profiling the code first to check where the bottleneck is before trying to optimize one part of the code, which might not slow down the training at all.
To do so you could use the native PyTorch profiler or e.g. Nsight Systems, which would give you an overview of the current processing.
Preloading the data and moving it directly to the GPU could work with the drawback of a potentially slower startup time and more GPU memory usage.

akshathsa · April 12, 2023, 2:11pm

Alright, thanks for the suggestion! I’ll try using a profiler and will follow up.