Processing sequential autoregressive model outputs in parallel on single gpu?


My model predicts sequentially on input samples X, where the output of the previous timestep “out” is always fed back into the model to get the next output:
out_1 = model(X_1,out_0)
out_2 = model(X_2,out_1)

out_t = model(X_t,out_t-1)

Currently, I feed the model sequences [X_1,X_2,…,X_8] of length 8 (and initial value out_0) to compute [out_1,…,out_8] and do the backprop. A single sequence is what I call here a batch. I.e. the length of the sequence = batch_size.

For a single batch, just a small fraction of my GPUs memory is utilized. Therefore, I’d like to process N batches in parallel (given the same state of the model) and then do the backprop every N batches. This would yield to much better memory usage of the GPU while the timing to process the batches (roughly) stays the same as for a single batch. A note: as the sequence length plays a role as a hyperparameter in setting, it is not simply sufficient to increase the batch_size / sequence length of a single batch such that the GPU mem is fully utilized.

I have not dealt with parallelization in pytorch, yet. I often find descriptions how to parallelize models and computations across multiple cuda devices (seriously, how many graphics cards do people have? XD), but not how to efficiently fill the memory of a single cuda device and compute stuff in parallel.


Best, JZ

The ability to run workloads in parallel depends on the available compute resources and the stream usage.
Usually kernels are written to occupy the GPU and a potential parallelism of different kernels is not taken into consideration. I.e. even if you schedule parallel kernels using different Streams no SMs could be free to actually launch the overlapping workload, but you could certainly try it out for your use case.