Data Parallel Experiments

ankur6ue · February 1, 2019, 4:32pm

Hello, I conducted some experiments to understand the different components of batch processing time and how they can be lowered by increasing parallelism. Two types of parallelism can be exploited – data can be loaded in parallel using multiple processes and data can be processed in parallel on multiple GPUs using data parallelism. I didn’t consider distributed data parallelism in these experiments (yet)

Dataset used is imagenet-200 consisting of 500 images of 200 classes. Thus, there are 100,000 total images to be processed in an epoch of training. For a batch size of 256, this results in roughly 390 batches to be processed in an epoch.

The time to process a batch can be split into data loading time, transfer time and processing time, as defined below.

• Data loading time: this is the time taken by the dataloader to load data from the disk into memory
• Transfer time: this is the time to transfer data from CPU RAM to GPU global memory (tensor.cuda() )
• Processing time: measures the time taken to run the forward pass, backward pass, loss calculation and parameter updates. Doesn’t include the time taken to transfer data from CPU RAM to the GPU global memory

I considered two networks – Resnet 18 and Resnet 50. I analyzed three cases:

1- base case with batch size of 64, num_workers = 4, num_gpus = 1
2- Data parallel with a batch size of 256, num_workers = 4, num_gpus = 4
3- Data parallel without pin_memory = true in the dataloader and non_blocking = true in the .cuda calls. Thus, loaded data is not copied to pinned memory by the dataloader. This allows for analyzing the importance of asynchronous data transfers.

Report describing the experiments is here:

https://drive.google.com/open?id=1hy39b5PwimJfT3fTKhgTa7FwGhXDQE_7

Most of the results were as expected, but there were a few surprises. I’d really appreciate it if someone from the Pytorch team could shed some light:

As shown in Table 5, the processing time per GPU for 4 GPUs for a batch size of 1024 turns out to be lower than the processing time for 1 GPU for a batch size of 256. This doesn’t make sense - the processing time in data parallel mode should be strictly higher due to the data parallel overhead (transfering the gradients from the slave GPUs to parameter server, calculating parameter updates and then updating the model on each slave GPU)
Pytorch documentation say: once you pin a tensor or storage, you can use asynchronous GPU copies. Just pass an additional non_blocking=True argument to a cuda() call. This can be used to overlap data transfers with computation.
Question: Is the converse true – i.e., when we set pin_memory = False, we need to set non_blocking = False in .cuda() calls?
With pin_memory = False and non_blocking = False, I expected the data loading time to be lower as the data loader now only needs to copy data to local memory, not pinned memory. However this time is now higher. Furthermore, I expected the processing time on the GPU to not be affected, however this processing time is now lower. The transfer time is now non-negligible, which is as expected. Can anyone shed light on this? I think what may be going on is that the transfer to pinned memory is itself aysnchronous, so with pin_memory = False, now the data is actually being written to the local memory adding to the loading time, whereas before the latency of asynchronous write to pinned_memory was hidden.

Appreciate any thoughts/feedback!
-Ankur

ankur6ue · February 1, 2019, 9:15pm

upon reading the docs, the answer to the second question is clear- if pin_memory is set to false, then setting non_blocking = true has no effect.

Looking at the code for the data loader, setting pin_memory = true sets up an additional thread which reads from the output queue of the worker threads and puts the batch into pinned memory and the pinned memory pointer into another queue, which is then queried during the next call.

Also, I believe with asynchronous data transfer, the processing time includes the data transfer time as well. When I put a cuda.synchronize() after the .cuda() calls, now the transfer time is non-negligible and the processing time is lower by the same amount. The sum adds up to the processing time with synchronous data transfer.