Distributed training issue with PyTorch estimator in AWS Sagmaker

Ashvinee · April 22, 2024, 10:54pm

I’m implementing distributed training on AWS Sagemaker utilizing PyTorch estimator.
It is based on the code sourced from this link

Here is the PyTorch estimator configuration:
estimator = PyTorch(…
framwork_version = ‘1.13.1’,
instance_count=1,
py_version = ‘py39’,
instancee_type = ‘ml.g5.12xlarge’,
distribution = {‘torch_distributed’: {‘enabled’: True}}
)

The training is getting stuck on the ‘generator’ line.
Here is the snippet of the code.

if args.distributed:
        generator = nn.parallel.DistributedDataParallel(
            generator,
            device_ids=[args.local_rank],
            output_device=args.local_rank,
            broadcast_buffers=False,
            find_unused_parameters=True
        )

        discriminator = nn.parallel.DistributedDataParallel(
            discriminator,
            device_ids=[args.local_rank],
            output_device=args.local_rank,
            broadcast_buffers=False,
            find_unused_parameters=True
        )

For 4 GPUs machine instance, the local_rank number can be in between 0 and total-1.
args.local_rank = 2 #between 0 and 3

The following runtime error generated:

RuntimeError DDP expects same model across all ranks, but rank 2 has 256 params, while rank 0 has inconsistent 0 params.

Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

I’m using image data with LMDB python library for generative model.
I’m using it in the same manner used for the non-distributed training.

I would greatly appreciate any insights or suggestions on the nn.parallel.DistributedDataParallel() of generative model.
How to resolve it?