Increase my speed with 8 GPUs

My model works with a packet size of 128 by 1 GPU. I want to speed up my work and use 8 GPUs. But I cannot increase the packet size. Using a packet size of 128 with 8 GPUs will not increase my speed. Is there a tutorial to speed things up on 8 GPUs without increasing the batch size?

Why would want to keep the global batch size constant and split the computation to different devices?
Note that GPUs will usually achieve a higher utilization for a larger workload (which would e.g. correspond to the batch size).

Sorry, but I didn’t understand your answer. I’ll try to ask again. In my model with one GPU, the packet size is 128. Accordingly, I do loss.backward () in increments of 128. If I have a packet size of 128 * 8 = 1024, then my model cannot learn.

Do you mean the batch size by “packet size”? If so, could you explain what you mean by

loss.backward() will calculate the average loss for the input and target batch by default and will thus compute the average gradients for all parameters w.r.t. this loss.

If you are increasing the batch size in a data parallel model you might need to adapt some hyperparameters such as the learning rate if your model is sensitive to the increased batch size.