Multi GPU Training is out of sync

omorrison1 · June 24, 2024, 12:47pm

Hello,

I am in the process of training a PyTorch mode across multiple GPUs (Using DDP).

I had been under the impression that synchronisation happened automatically and Epochs appeared to occur at approximately the same time on each GPU.

However, upon running longer jobs, I have found that the two GPUs gradually become out of sync. With One GPU still processing much earlier epochs than the other.

I have tried adding torch.cuda.synchronize at the end of each epoch, but this doesn’t appear to make any difference at all.

Do you have any advice on how to move forward with this problem?

Thanks,
Oliver

ptrblck · June 24, 2024, 8:38pm

DDP synchronizes the gradients in the backward pass (via buckets) and the optimizer.step() call is performed on all ranks before the next training iteration starts to make sure the models are at the same state. Do you see a different in model parameters during the forward pass?

omorrison1 · June 25, 2024, 5:42pm

The model is rather large.

I am not entirely sure how to best monitor the parameters.

Is there a simple way to check if the state has changed?

As a quick check I have printed the sum of the parameters before and after optimiser.step() on each GPU:

[19:08:02] [GPU 1] (Epoch 0) BEFORE 1256.03098
[19:08:03] [GPU 1] (Epoch 0) AFTER 1256.06498
[19:08:03] [GPU 0] (Epoch 0) BEFORE 1256.03098
[19:08:03] [GPU 0] (Epoch 0) AFTER 1256.06498

[19:08:05] [GPU 1] (Epoch 1) BEFORE 1256.06498
[19:08:05] [GPU 1] (Epoch 1) AFTER 1256.12020
[19:08:07] [GPU 0] (Epoch 1) BEFORE 1256.06498
[19:08:07] [GPU 0] (Epoch 1) AFTER 1256.11947

[19:08:09] [GPU 1] (Epoch 2) BEFORE 1256.12020
[19:08:09] [GPU 1] (Epoch 2) AFTER 1256.28199
[19:08:11] [GPU 0] (Epoch 2) BEFORE 1256.11947
[19:08:11] [GPU 0] (Epoch 2) AFTER 1256.27338

Thanks,
Oliver

omorrison1 · June 26, 2024, 12:16pm

and the optimizer.step() call is performed on all ranks before the next training iteration starts

It doesn’t appear to be the case for me?

I have been following the approach found here: Multi GPU training with DDP — PyTorch Tutorials 2.3.0+cu121 documentation

Optimizer.step is called at different times on each GPU depending on when the GPU happens to get to that line in the code and the parameter updates don’t seem to be communicated to other GPUs.

ptrblck · June 26, 2024, 12:43pm

Parameter updates won’t be communicated, but the gradients will be all-reduced before the optimizer.step() call is performed as explained before.

Your “sum-check” indicates the parameters are almost equal. So far nothing points to the GPUs being out of sync.

omorrison1 · June 26, 2024, 1:09pm

Sorry, I seem to have a fundamental misunderstanding about how this process works.

I was under the impression that the GPUs communicated so that each local copy of the model has the same gradients.

Then when the optimizer step is performed, the local copies of the model complete the epoch together reaching the same end state.

If each GPU produces produces an entirely distinct model with different parameter values, then what is the value of distributed training?

Currently, I am adding a while loop at the end of each epoch which checks to see if a checkpoint file exists for that epoch, because without that in place GPUs will drift many epochs apart.

Thanks,
Oliver

omorrison1 · June 26, 2024, 1:26pm

A more pronounced example is seen here:

[14:22:36] [GPU 1] (Epoch 13) ----End of Epoch | 1061.04428,

[14:22:39] [GPU 0] (Epoch 13) ----End of Epoch | 1200.90550,

The sum of parameters is much further apart and GPU 1 starts the next epoch 3 seconds before GPU 0.

If I ended training at this point (the end of epoch 13), which model would I save? If I save the model on GPU 0, then would I not be losing all the training that took place on GPU 1?

Thanks,
Oliver

ptrblck · June 26, 2024, 3:29pm

This is correct and is also explained in the design docs.

This is also correct as explained before and shown by the output in your previous post.

That’s not the case. Since the parameters and gradients are (almost) equal, the (almost) same models will be created up to floating point precision errors.

If you have a reproduction, showing this is not the case, please post it.

omorrison1 · June 27, 2024, 1:48pm

Unfortunately, my code base has grown rather complicated, so I don’t have a convenient reproduction of this strange behaviour.

I use “model.module.forward” in place of “model.forward”, is it possible that this is the reason why the synchronisation is not taking place?

If jobs a run for long enough one GPU will get many epochs ahead of the other, so I know that they aren’t performing these optimisation steps together.

Is there perhaps some way to monitor these “buckets”, to check that the gradients are actually being broadcast and received?

Thanks,
Oliver

ptrblck · June 27, 2024, 2:03pm

Yes, if you directly access the internal .module directly, you will skip the DDP logic as it depends on the direct model call via output = model(input)

omorrison1 · June 27, 2024, 2:55pm

I have been using “model.module.forward”, because when I use “model.forward” I encounter an error.

The error is raised when I reach the backwards pass and always states that the “version” of one of the tensors is one more than it should be. (see example below)

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.DoubleTensor [87, 4]], which is output 0 of AddBackward0, is at version 1; expected version 0 instead.

This error is avoided is I am using a single GPU or if I am using “model.module.forward”, so I guess the unexpected modification is somehow linked to DDP?

EDIT:

In particular, the error is not raised when the first GPU reaches the backwards pass. It is raised when the second GPU reaches the backwards pass.

Thanks,
Oliver

ptrblck · June 27, 2024, 5:21pm

You would need to narrow down which tensor is modified inplace and how it correlates with the usage of DDP. From the error message I wouldn’t know which operation is causing it and why it’s specific to DDP.

nickums · June 28, 2024, 4:34pm

suggest you use ‘Accelerate’, it automatically distributes torch processing over available GPUs and has features for synchronising the parallel processing. See Methods and tools for efficient training on a single GPU
Overview
Nick