Is it true that you can increase your batch size up till your ~maximum GPU memory before loss.step() slows down? I thought a GPU would do computation for all samples in the batch in parallel, but it seems like Pytorch GPU-accelerated backprop takes much longer for bigger batches. It could be swapping to CPU, but I look at nvidia-smi Volatile GPU Memory usage and it is under 70%.
The short answer to your question is no.
In general, a GPU can only do calculations in parallel up to the
number of pipelines it has. A GPU might have, say, 12 pipelines.
So putting bigger batches (“input” tensors with more “rows”) into
your GPU won’t give you any more speedup after your GPUs are
saturated, even if they fit in GPU memory. Bigger batches may
(or may not) have other advantages, though.
As an aside, you probably didn’t mean to say
Pytorch loss functions don’t have
step methods. (Optimizers
do.) The part of the typical training iteration that processes a
batch (potentially partially in parallel) is when you call something
prediction = model (input).
Also it’s not clear to me which part of the calculation you mean
when you say “backprop”. If you mean updating your model
weights, this occurs when you call
optim.step(), and this
piece is independent of the size of the batches. (However, the
gradients used by
optim.step() are being accumulated when
as noted by @K_Frank you can take a look at CUDA streams for further information about how CUDA launches GPU-kernels.
@K_Frank Awesome, great to know.
I actually meant .backwards(). Typically, the three steps for backprop:
I believe backwards() usually takes the longest by far, way more than the inference and opt step etc.
So how do we determine the maximum batch size for 100% parallel computation? Is there a code / diagnostic tool we can use to determine when doing multi GPU backprop will increase speed?
yes, Nvidia has a tool to measure performance. But being fast on GPU not only depends on how many streams you launch at a time. It will also depend on memory transfers, memory fetching and so on.
However, it should be noted that the bottleneck of your algorithm is given by the GPU memory, because if your model is very large, autograd has to store all the computation history, so if your batch is bigger, pytorch will use more memory. In general, unless you call
cudaDeviceSynchronize what happens is that cuda queues all the kernels you want to launch, and the main thread continues its execution. Then, CUDA is what manage how kernels are queued and when to launch them, I do not know if you can control that. What you can control is how you fetch the memory, branch divergence, how many threads you launch by block or by grid and so on…
I see, that is interesting about the batch and computation history in memory. But like I said nvidia-smi shows the GPU memory is not used up.
I need to do whatever I can to make the backprop faster as I’ve determined that to be by far the bottleneck (30 seconds to compute backwards() on 128,000 samples, each sample a vector of length 50). The only thing I can think of right now is to scale with multi GPU
You hinted at some more advanced techniques @jmaronas . I feel like if there was any way to speed things up it would be included in pytorch already. Because all deep learning is doing the same matrix multiplications. So I feel like there’s not many practical gains I can make by going deeper below the pytorch abstraction layer
I was looking into nn.DataParallel https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html
Let me correct something I said above that wasn’t entirely right.
model (input) does (typically) work on an entire batch, but
that is not the only part of training that does. As both you and
@jmaronas correctly point out,
loss.backwards() does as
well. Derivatives can involve more calculation, so, as you noted,
loss.backwards() is likely to be your most time-consuming step.
But my main points still stand:
A GPU typically only has a handful of pipelines, so you
don’t need much of a batch to saturate them. (And there
is plenty of opportunity for vectorization / parallelism even
with a batch size of one.)
The primary purpose of using batches is to make the
training algorithm work better, not to make the algorithm
use GPU pipelines more efficiently. (People use batches
on single-core CPUs.)
So increasing your batch size likely won’t make things
run faster. (More precisely, it won’t generally let you run
through an epoch faster. It might make your training
converge more quickly or more slowly.)
I don’t have any advice on how to make your training run
faster. There might be some tuning you can do, but my
working assumption is that, under the hood, pytorch does
a pretty good job of using the GPUs efficiently. It may be
that you need faster hardware – a faster GPU with more
pipelines, or a machine with multiple GPUs. (Or maybe
more time and patience …)