Help Needed with DDP Training on Multiple Nodes Unexpected Low Accuracy with ResNet50 and MoCo v3

Hello PyTorch Community,

I am a beginner in distributed training using PyTorch and have been facing some issues with Distributed Data Parallel (DDP) training. Despite my efforts, I am unable to achieve the expected results and would greatly appreciate any guidance or recommendations.

System Setup:
I have 7 nodes managed by a PBS script, which is used to submit jobs to the scheduler.
I am using torchrun to execute the training on each node with the following command:

torchrun --nproc_per_node=1 --node_rank=$RANK --nnodes=$NNODES --master_addr=$MASTER --master_port=9977 --max-restarts=3 main.py

Problem Description:

The distributed training process is functioning correctly, and I am able to observe accelerated training. However, the accuracy results are significantly lower than expected.

ResNet50 on ImageNet:

Training ResNet50 using the ImageNet dataset in a supervised approach.
The best top-1 accuracy I can achieve is 30%.
MoCo v3 on Full ImageNet:

Training MoCo v3 using the full ImageNet dataset.
After 100 epochs, the best top-1 accuracy I achieve is 40%, whereas the author reports an accuracy of 69%.

Key Details:
I am using the exact parameters recommended by the authors of MoCo v3.
The only difference in my setup is the use of torchrun instead of mp.spawn.

Request for Help:
I have been stuck on this issue for almost two months and am unable to understand the discrepancies in the results. Here are my specific questions:

  • Could the use of torchrun instead of mp.spawn be causing this issue? If so, how can I mitigate this?

  • Are there any common pitfalls or configuration issues in DDP that might lead to such a significant drop in accuracy?
    Is there any additional debugging or logging I should enable to better understand the problem?

  • Any other suggestions or recommendations that could help me align my results with the expected benchmarks?

I would greatly appreciate any insights or advice from those who have experience with distributed training and similar setups. Thank you in advance for your time and assistance.

If you are hesitant about the correctness of your distributed training setup, one thing you can do is choose a small global batch size and try to check the convergence when running that global batch size on a single GPU vs. running that global batch size on multiple GPUs with DDP.

For example, if you have 4 GPUs, you can choose global batch size of 4. On your single GPU, you run with batch size of 4 and record the curve. On your four GPUs with DDP, you run with local batch size of 1 and record the curve. If the curves are not matching well, then there is probably something wrong in the setup.

Hello,

Thanks for the reply. I did the experiment and I am unsure if the curves match very well. Here are the details and results:

Model and Dataset:

  • Model: ResNet-18
  • Dataset: Custom image dataset
  • Number of Classes: 30

Hyperparameters:

What is your loss function?

You may need to reduce the loss across DDP workers depending on what the loss function is to make it comparable to single GPU. The validation curve seems to match reasonably.

I am using two different loss functions depending on the training scenario:

  1. Supervised Training:
  • Loss Function: CrossEntropyLoss.
  • Optimizer: The optimizer used is optim.Adam.
  1. Contrastive Learning:
  • Loss Function: For contrastive learning, we employ the contrastive loss function implemented by the MoCo research team.
  • Optimizer: the optimizer used here is optim.SGD.

Could you clarify what you mean when you say reduce the loss across DDP workers?

I mean to think about data parallel training mathematically. You are sharding your logical/global batch on the batch dimension, so if your forward/backward is not directly parallelizable on the batch dimension, then you may need some communication (e.g. all-reduce) to recover needed information across data parallel workers.

I am not familiar with the MoCo contrastive loss, but maybe you can check if this is parallelizable on batch dimension. For example, if your loss function depends on the number of negative examples in the batch, but now after data parallelism, the negative examples in the global batch are partitioned across data parallel workers, then you may need to think about what the implication is on the loss.