Mp.spawn() doesn't parralize the DDP in HPC

ashimdahal · May 5, 2024, 9:32pm

Hi,

I was following this tutorial by PyTorch on DDP.
And I encountered a logging error in the HPC. The code written is the same as the one in the github example here: examples/distributed/ddp-tutorial-series/multigpu.py at main · pytorch/examples · GitHub

The code was submitted to the HPC using sbatch and the following configuration

#!/bin/bash
#SBATCH --job-name=testing           # Job name
#SBATCH --output=output.log          # Output file name
#SBATCH --error=error.log            # Error file name
#SBATCH --partition=gpu              # Partition 
#SBATCH --gres=gpu:4                 # GPU resources, requesting 4 GPUs
#SBATCH --ntasks=1                   # Number of tasks or processes
#SBATCH --nodes=1                    # Number of nodes

# Load necessary modules or activate virtual environment
module load cuda-toolkit/11.6.2
module load python/3.8.6

# Actual command to be executed
python torch_testing.py

but the problem arises in the output.log where the mp.spawn() seems to run all the epochs in one gpu before moving into the other gpu

[GPU2] Epoch 0 | Batchsize: 64 | Steps: 8
[GPU2] Epoch 1 | Batchsize: 64 | Steps: 8
[GPU2] Epoch 2 | Batchsize: 64 | Steps: 8
[GPU2] Epoch 3 | Batchsize: 64 | Steps: 8
[GPU2] Epoch 4 | Batchsize: 64 | Steps: 8
[GPU2] Epoch 5 | Batchsize: 64 | Steps: 8
[GPU2] Epoch 6 | Batchsize: 64 | Steps: 8
[GPU2] Epoch 7 | Batchsize: 64 | Steps: 8
[GPU2] Epoch 8 | Batchsize: 64 | Steps: 8
[GPU2] Epoch 9 | Batchsize: 64 | Steps: 8
[GPU3] Epoch 0 | Batchsize: 64 | Steps: 8
[GPU3] Epoch 1 | Batchsize: 64 | Steps: 8
[GPU3] Epoch 2 | Batchsize: 64 | Steps: 8
[GPU3] Epoch 3 | Batchsize: 64 | Steps: 8
[GPU3] Epoch 4 | Batchsize: 64 | Steps: 8
[GPU3] Epoch 5 | Batchsize: 64 | Steps: 8
[GPU3] Epoch 6 | Batchsize: 64 | Steps: 8
[GPU3] Epoch 7 | Batchsize: 64 | Steps: 8
[GPU3] Epoch 8 | Batchsize: 64 | Steps: 8
[GPU3] Epoch 9 | Batchsize: 64 | Steps: 8
[GPU1] Epoch 0 | Batchsize: 64 | Steps: 8
[GPU1] Epoch 1 | Batchsize: 64 | Steps: 8
[GPU1] Epoch 2 | Batchsize: 64 | Steps: 8
[GPU1] Epoch 3 | Batchsize: 64 | Steps: 8
[GPU1] Epoch 4 | Batchsize: 64 | Steps: 8
[GPU1] Epoch 5 | Batchsize: 64 | Steps: 8
[GPU1] Epoch 6 | Batchsize: 64 | Steps: 8
[GPU1] Epoch 7 | Batchsize: 64 | Steps: 8
[GPU1] Epoch 8 | Batchsize: 64 | Steps: 8
[GPU1] Epoch 9 | Batchsize: 64 | Steps: 8
[GPU0] Epoch 0 | Batchsize: 64 | Steps: 8
Epoch 0 | Training checkpoint saved at checkpoint.pt
[GPU0] Epoch 1 | Batchsize: 64 | Steps: 8
[GPU0] Epoch 2 | Batchsize: 64 | Steps: 8
Epoch 2 | Training checkpoint saved at checkpoint.pt
[GPU0] Epoch 3 | Batchsize: 64 | Steps: 8
[GPU0] Epoch 4 | Batchsize: 64 | Steps: 8
Epoch 4 | Training checkpoint saved at checkpoint.pt
[GPU0] Epoch 5 | Batchsize: 64 | Steps: 8
[GPU0] Epoch 6 | Batchsize: 64 | Steps: 8
Epoch 6 | Training checkpoint saved at checkpoint.pt
[GPU0] Epoch 7 | Batchsize: 64 | Steps: 8
[GPU0] Epoch 8 | Batchsize: 64 | Steps: 8
Epoch 8 | Training checkpoint saved at checkpoint.pt
[GPU0] Epoch 9 | Batchsize: 64 | Steps: 8

I set the code from the github to save after every 2 epochs and run for a total of 10 epochs.

Any insight on how and why the mp.spawn() behaves this way or is there a way to fix this in the hpc would be greatly appreciated.

ptrblck · May 6, 2024, 1:27pm

I would guess the logs are wrong and serialized for some reason and that DDP itself is still working. You could write out the system time to the log to check if your setup is creating these interleaved logs.

ashimdahal · May 8, 2024, 2:07pm

would try and update.

ashimdahal · May 19, 2024, 3:58pm

Update: Indeed the output.log seems to be serialized.

ptrblck · May 20, 2024, 1:51am

Thanks for checking. In this case you might want to sort the logs based on the timestamps and GPU IDs to check the real progress (or reduce the losses from all GPUs in case you want a single loss per epoch).