My model predicts sequentially on input samples X, where the output of the previous timestep “out” is always fed back into the model to get the next output:
out_1 = model(X_1,out_0)
out_2 = model(X_2,out_1)
out_t = model(X_t,out_t-1)
Currently, I feed the model sequences [X_1,X_2,…,X_8] of length 8 (and initial value out_0) to compute [out_1,…,out_8] and do the backprop. A single sequence is what I call here a batch. I.e. the length of the sequence = batch_size.
For a single batch, just a small fraction of my GPUs memory is utilized. Therefore, I’d like to process N batches in parallel (given the same state of the model) and then do the backprop every N batches. This would yield to much better memory usage of the GPU while the timing to process the batches (roughly) stays the same as for a single batch. A note: as the sequence length plays a role as a hyperparameter in setting, it is not simply sufficient to increase the batch_size / sequence length of a single batch such that the GPU mem is fully utilized.
I have not dealt with parallelization in pytorch, yet. I often find descriptions how to parallelize models and computations across multiple cuda devices (seriously, how many graphics cards do people have? XD), but not how to efficiently fill the memory of a single cuda device and compute stuff in parallel.