Multi-GPU Dataloader and multi-GPU Batch?


I’m trying to load data in separate GPUs, and then run multi-GPU batch training.
I’ve managed to balance data loaded across 8 GPUs, but once I start training, I trigger an assertion:

RuntimeError: Assertion `THCTensor_(checkGPU)(state, 5, input, target, weights, output, total_weight)' failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /pytorch/aten/src/THCUNN/generic/

This is understandable: the data is already spread across various devices.
Is this not supported? I understand that multi-GPU batch will do that during the iteration of the batch,
so is the only way to achieve this, to copy across different devices, on the device where the batch is being used?

How can I get the device where the batch will be placed? I know that I can query the device on which the current input/sample tensor is, and do a copy from one to another (although I’m not sure this is ideal) provided that forces the batch and the tensors to reside in the same device.

Also, would that create a bottleneck?

I just realized that there exists a method called nn.parallel.scatter and other similar ones, which I am going to assume achieve the same functionality?

The parallel methods are used in e.g. nn.DataParallel to scatter and gather the tensors and parameters to and from multiple GPUs.

Generally speaking, the data and model have to be on the same device, if you want to execute an operation on both of them.
I’m not sure to understand your use case completely, but you could have a look at nn.DistributedDataParallel and see, if this implementation would work for you.

Hi @ptrblck thanks for the prompt reply.

I am trying to distribute data across multiple GPUs, because I’ve found that training time decreases dramatically when I’ve allocated them on the GPU before I start training.

As you can imagine, with large data-sets that’s very hard to do, the GPU memory is also used by the model and the batch samples. So I am looking into having parts of the data-set across multiple-GPUs.

According to the error produced, and as you explained that’s a problem because the weights, batch samples and other ops (CE Loss tensors) need to be on the same device?

It is my understanding (please correct me if I’m wrong) that I need to have copies of the network/model across multiple GPUs (which is where nn.DataParallel comes into play) and I also need to ensure that the batch tensors are also on the same GPU?
It is the last part I don’t know how is done.
I’ve been looking into nn.DistributedDataParallel but the tutorial doesn’t explicitly state how to distribute Dataloader across multiple GPU, and even more importantly, how to ensure the device id is the same as the model and batch during training. I am sure it is something very simple.

Alternatively, I was wondering if I should have multiple dataloaders each associated with a GPU, and then just train a model across those different dataloaders/GPUs.


To be more clear, I’ve been looking at the tutorial here:

The nn.DistributedDataParallel example explains how to send the model and other parameters across seamlessly:

    model = some_model().to(device_ids[0])
    ddp_model = DDP(model, device_ids=device_ids)

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(device_ids[0])
    loss_fn(outputs, labels).backward()

What however I don’t know, is how to ensure that the batch input (in this case the torch.randn) is on the same cuda device. For example, the tutorial from nn.DataParallel ( shows that within the epoch and batch training loop, I copy (or upload) to a device:

for data in rand_loader:
    input =
    output = model(input)

Suppose the above (^^^) is my training loop, and my tensors (input and associated output/classes) have been already loaded to different GPUs, how can I ensure that they are on the correct gpu device?

The closest I’ve seen to explaining how to do this is this tutorial ( but it doesn’t actually answer that question.

Many thanks!

The usual work flow would be to try to load the next batch into the RAM while the GPUs are busy training the model to hide the latency of the data loading step.
In each iteration you would push a batch to the device and start loading the next batch.

Yes, that’s correct. In order to execute the wanted operation (e.g. a matrix multiplication), the parameters (data and weight) need to be on the same device.

Generally yes. However, you usually don’t need to care about the actual data scattering and gathering as DDP will do this all for you.
You would have to specify a default device (e.g. GPU0), push the data onto this default device, and execute your training. DDP should automatically scatter the provided data in dim0 to all devices, as well as take care of all synchronization.

Multiple DataLoaders might make sense for specific use cases, but you would have to check your system resources and see, if this would create a bottleneck (loading data from SSD, preprocessing on CPU etc.).
I would recommend to use a single DataLoader and DDP first as given in the liked tutorial.

nn.DataParallel and DDP will split the batch in dim0 and scatter each chunk to the corresponding device.
This tutorial gives you a good example.

Let me know, if something is unclear. :wink:

1 Like

@ptrblck Thank you for the precise answer! :slight_smile:

One last question, is there a way to get the context of the current device (e.g., as in current thread context)? For my intents I can use DDP, but I still have to move data across devices due to the approach I’ve taken of uploading early on GPU. I can see this done by moving from one device to another, I just don’t know if pytorch will currently support it, using tensor move semantics across devices.

The usual work flow would be to try to load the next batch into the RAM while the GPUs are busy training the model to hide the latency of the data loading step.
In each iteration you would push a batch to the device and start loading the next batch.

Is there an example of how to do this? E.g., how to know beforehand the tensors of the batch so I can start uploading before the batch runs through the ops?

I’ve followed the nn.DataParallel turorial and placed on CPU RAM my pre-processed dataset, which then uploads during the training to a generic cuda letting the mechanism assign the tensors.

I’m using an AWS p2.8xlarge with 8x K80 GPUs and what has shocked me is how slow it is. The equivalent on my 2080Ti is magnitudes faster. I’ve pinned the memory, but I suppose the copy/upload is expensive. It also has to use cuda().float() rather than half().

I’ve also tried on a p3.8xlarge which has 4x V100, whilst considerably faster because I’m using half precision, to my surprise the GPUs are only used up to 40%, and usually average less than that.

Using the a p3.xlarge with 1x V100 GPU I took the same approach: uploaded all tensors on CPU RAM (pinned) and then run the same exact code (only now there’s one GPU). The GPU was used on average 86% and had about 2/5 of the memory occupied by the model and batch size.

Finally, I did the comparison of CPU-to-GPU and GPU-only using with my own 2080Ti, only I can’t fit the entire data-set in the GPU (hence why I first started looking into multi-GPU allocated data-loaders).

My approach is simple, I run the hard disk loading at construction, transform the images to tensors, and then upload all of them on the GPU rather than the CPU. This requires that the model, batch and data-set will fit on the GPU memory.

The comparison isn’t fair: Here I’m using less samples per class, whereas before I was using 5000 samples per class. In the case of AWS I had 50,000 total samples, whereas on the 2080Ti I could only upload 20,000 samples before running out of GPU RAM.

However; the GPU is used at 100% at all times, and the GPU RAM is used at 95%-98%. In the case of the 4x V100 (or even the 1x V100 when using CPU RAM) there is a stark difference in how fast training takes place. There are no copies between CPU-to-GPU and this offers a huge advantage. I understand that if I was to use larger models I’d have issues, or if I used larger batches I might run out of memory.

I also understand that the mechanisms involved in data synchronisation are complex (GPU to GPU is synchronous compared to CPU to GPU which is async).

Finally some results:
4x V100 took: 0:32:51 to run 50 epochs at 128 batch size (50,000 samples in total) from CPU-to-GPU
1x V100 took: 0:36:44 to run 50 epochs at 128 batch size (50,000 samples in total) from CPU-to-GPU
1x 2080Ti took: 0:19:44 to run 50 epochs at 128 batch size (20,000 samples in total) from GPU-only

So I am wondering if there is something I can do, to achieve orchestration of data split across multiple-GPUs. I imagine one solution would be to extend nn.DataParallel so that if multiple GPUs are present, each batch (or multiple batches) run in parallel, with model copies across all GPUs. This would require some form of fusing the models at the end of the batches.

Another approach would be to do GPU-to-GPU copies in order to ensure that the batch tensors and model are on the same device. This may become a bottle neck.
If this is not the place to have this conversation, I am willing to contribute to PyTorch in order to work on this, as I believe would be very beneficial for people who work with smaller data-sets (smaller is relative, as the 4x V100 will hold 64Gb of data excluding the model size and batch data).