Data Parallelism Doesn't speed up training

Hello I heard of the super simple api of data parallelism in PyTorch so I decided to give it a try but after profiling I found almost identical results between using & not using the parallelism feature (DESPITE seeing all 4 GPUs active during training). In each instance I get roughly:
duration: 56.92420029640198, loss: 2.6403932571411133

Code Comments:
I’m using Transformer XL pertained model from hugging face & a custom training loop.
I tried to keep the info as minimal as possible. The background regarding my training loop is that I was getting memory leaks unless I allocated a reusable mini_batch tensor. I’m concerned this could be related as I’m not sure how/if I need to distribute this Tensor manually to each GPU as well…


# ...
model = AutoModelWithLMHead.from_pretrained('xlnet-base-cased').to('cpu')
# ...

import torch as pt
from torch import optim

# NOTE: this is the 'memory efficient version' for CUDA
# pad_len := max number of tokens (input sequences padded to this length)
def train_loop(model, input_output_data, epochs=5, batch_size=256, pad_len=200):
    input, output = input_output_data
    n_examples = len(input)
    n_batches = int(n_examples/batch_size+0.99999)
    model.train() # turn on training
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    # helpers for tensor dict manipulation
    slice_inputs = lambda x, a, b: {k:x[k][a:b] for k in x}
    cast_inputs = lambda x, device='cuda': {k:x[k].to(device) for k in x}
    def assign_dict(a,b):
        for k in b:
            a[k][:] = b[k]
    # they must be padded to the same size for batching to work...
    all_inputs = tokenizer(input, return_tensors='pt', padding='max_length',
                           truncation=True, max_length=pad_len)
    all_outputs = tokenizer(output, return_tensors='pt', padding='max_length',
                           truncation=True, max_length=pad_len)
    all_inputs = cast_inputs(all_inputs, 'cpu')
    all_outputs = cast_inputs(all_outputs, 'cpu')
    # The idea was to have a reusable mini-batch tensor to avoid memory leaks...
    inputs = slice_inputs(all_inputs, 0, batch_size)
    outputs = slice_inputs(all_outputs, 0, batch_size)
    inputs = cast_inputs(inputs, 'cuda')
    outputs = cast_inputs(outputs, 'cuda')

    last_loss = None
    for i in range(epochs):
        print(f'epoch: {i+1}/{epochs}')
        for j in range(n_batches):
                a, b = (j*batch_size, (j+1)*batch_size)
                assign_dict(inputs, slice_inputs(all_inputs, a, b))
                assign_dict(outputs, slice_inputs(all_outputs, a, b))
                loss = model(**inputs, labels=outputs['input_ids'])[0].mean()
            except Exception as e:
            print(f'batch: {j+1}/{n_batches}, loss: {loss}')
            last_loss = loss.item()
    return last_loss

Here is the code I use to actually perform the training (I switch model=serial_model to model=fast_model for comparison):

# Train and Profile
import time
import torch as pt

fast_model = pt.nn.DataParallel('cuda'))
serial_model ='cuda:0')
model = serial_model # switch to fast_model for comparison

start = time.time()
final_loss = train_loop(model, (input, output), batch_size=10, epochs=1)
duration = time.time() - start
print(f'duration: {duration}, loss: {final_loss}')

P.S. LMK if you want full code, I tried to keep it minimal.
Thanks for your help!

What is the amount of time spent loading data vs. doing computation on each of the GPUs?

Additionally, the use of torch.cuda.empty_cache() can add overhead if it is done for every batch. Is there some kind of issue where other processes are using the same GPU? Otherwise, this seems like it would be unnecessary.

1 Like

I’m not sure exactly, judging based on nvidia-smi I’d estimate 50% time spent loading and 50% doing computation (this estimate is based on % of time I saw each GPU being utilized).

You’re right it is probably not necessary it is just a remnant of a previous solution I tried to deal with a memory leak (which is now gone).

I will retest without clear cache and get back to you.

In that case you might see more improvement if you parallelize the data loading time. Have you tried using the dataloaders: — PyTorch 1.9.0 documentation for this part?

1 Like

No I haven’t, I will try that next thanks!

That said I have already tested preloading entire dataset to GPUs. So aside from data loading, I know PyTorch recommends using DataParallelDistributed, but what else could there be that is causing such small speed gains from x4 parallelism?

There may be indeed a bottleneck from using one process instead of one-process per GPU, but seeing no speedup suggests that it indeed something else that is bottlenecking the GPUs… Generally it is preferable to get close to 100% utilization on a single GPU before moving to multiple GPUs.

1 Like

I am actually seeing some speed up, just not as much as I expected…
Currently I when testing Preloading data onto GPU (slightly faster than parallel data loading I imagine but less scalable). I am getting a time reduction from 42 secs → 25 secs per epoch (thanks to your help!).

That’s almost 2x speed with x4 GPUs, it’s not bad but I’m wondering if I should be seeing more?

P.S. I am seeing 100% GPU utilization on each GPU in spurts but not consistently.

At this stage it might be useful to drill down on the distribution of how time is spent in the training loop by adding a bunch of time.time() statements and narrowing down what the relative cost of each operation is. Note that to do this you would want to add torch.cuda.synchronize before and after starting timing for parts containing GPU operations to ensure that the timing information is accurate.

Once you’ve optimized the bottlenecks, you would then want to remove these synchronize calls to reduce the overhead.

1 Like

Thanks! I’ll do that you’ve been super helpful!

1 Like