Data Parallelism on single GPU

According to the PyTorch tutorials at https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
which says data parallelism can be done if there are multiple-GPUs, by creating replicas of the model on different GPUs.

Is it possible to use data parallelism on a single GPU device by using more memory on the same device to create replicas of the model and parallelizing the training of different batches on these replicas of the model?

My model is three convolutional layers deep and but I have larger batch size. So, it can parallelize by processing smaller batches in different replicas of the model.

Thanks in advance.

You could try to execute the same script multiple times and check, if the performance really increases.
Besides the GPU memory, the actual computation will use resources on the device and can saturate it completely, so that multiple scripts will enqueue their kernels on the device.

The GPU util output in nvidia-smi gives the percentage of the usage of compute in the last time frame, which might be used as an indicator.

1 Like

Thanks for your response.
In general, PyTorch does not automatically parallelize the forward pass in a model?

Can you please check the following results using the autograd profiler.
The computation time of forward pass of through a network of different batch sizes. If it would have been parallelized then for larger batch sizes it could reduce the computation time.

Also, I don’t understand why the CPU time avg and CUDA time avg are almost identical?

Thanks.

PyTorch tries to max out the GPU utilization for the executed operations, such that multiple script executions might not be able to run in parallel.
The CPU time might accumulate the GPU execution time, but I’m not completely sure, as I’m usually either profile operations in isolation (using torch.cuda.synchronize() and a timer) or with nvprof/nsys.

I haven’t used both nvprof/nsys and torch.cuda.synchronize().

I have just used PyTorch’s autograd profiler with use_cuda = True and torch.autograd.profiler.record_function() over these functions.

Is that not the right way to profile in PyTorch? Also, I didn’t understand the use of nvprof for CUDA events. Can you explain about it?

It might be the right way to profile the code and as I said, I’m not familiar enough with the output and think that the CPU time might be accumulated during the GPU runtime.

I usually use markers and nprof/nsys, as it gives me finer-grained control over the methods I would like to profile.

You could use nvtx markers and have a look at this code snippet for an example use case.
You can find more information about nsight systems here.

1 Like

Is it possible to use data parallelism on a single GPU device by using more memory on the same device to create replicas of the model and parallelizing the training of different batches on these replicas of the model?

Why not just use a larger batch size? DDP is essentially just a larger batch.