RuntimeError : params[6] in this process with sizes [128, 64, 1, 1] appears not to match strides of the same param in process 0

I am using torch.nn.parallel.DistributedDataParallel to train a vision model, but get the following error:

Using native Torch DistributedDataParallel.
Traceback (most recent call last):
  File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1032, in <module>
Traceback (most recent call last):
  File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1032, in <module>
Traceback (most recent call last):
  File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 1032, in <module>
    main(args)
  File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 521, in main
    model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
  File "/scratch365/ypeng4/software/bin/anaconda/envs/python310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py",line 674, in __init__
    main(args)
  File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 521, in main
    model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
  File "/scratch365/ypeng4/software/bin/anaconda/envs/python310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py",line 674, in __init__
    main(args)
  File "/afs/crc.nd.edu/user/y/ypeng4/Neighborhood-Attention-Transformer/classification/train_original.py", line 521, in main
    model = NativeDDP(model, device_ids=[args.local_rank], broadcast_buffers=not args.no_ddp_bb)
  File "/scratch365/ypeng4/software/bin/anaconda/envs/python310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py",line 674, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/scratch365/ypeng4/software/bin/anaconda/envs/python310/lib/python3.10/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/scratch365/ypeng4/software/bin/anaconda/envs/python310/lib/python3.10/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/scratch365/ypeng4/software/bin/anaconda/envs/python310/lib/python3.10/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError    : params[6] in this process with sizes [128, 64, 1, 1] appears not to match strides of the same param in process 0.return dist._verify_params_across_processes(process_group, tensors, logger)

    return dist._verify_params_across_processes(process_group, tensors, logger)RuntimeError
: params[6] in this process with sizes [128, 64, 1, 1] appears not to match strides of the same param in process 0.
RuntimeError: params[6] in this process with sizes [128, 64, 1, 1] appears not to match strides of the same param in process 0.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1833 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 1834) of binary: /scratch365/ypeng4/software/bin/anaconda/envs/python310/bin/python
Traceback (most recent call last):
  File "/scratch365/ypeng4/software/bin/anaconda/envs/python310/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/scratch365/ypeng4/software/bin/anaconda/envs/python310/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/scratch365/ypeng4/software/bin/anaconda/envs/python310/lib/python3.10/site-packages/torch/distributed/launch.py", line196, in <module>
    main()
  File "/scratch365/ypeng4/software/bin/anaconda/envs/python310/lib/python3.10/site-packages/torch/distributed/launch.py", line192, in main
    launch(args)
  File "/scratch365/ypeng4/software/bin/anaconda/envs/python310/lib/python3.10/site-packages/torch/distributed/launch.py", line177, in launch
    run(args)
  File "/scratch365/ypeng4/software/bin/anaconda/envs/python310/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/scratch365/ypeng4/software/bin/anaconda/envs/python310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/scratch365/ypeng4/software/bin/anaconda/envs/python310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

However, when I changed the pytorch DDP to

from apex.parallel import DistributedDataParallel as ApexDDP

everything works fine.

I used group convolution, when I changed the group convolution to regular convolution, the pytorch DDP works fine.

How can I correct this error of pytorch DDP training?

I am using 2.0.0+cu117

DDP from apex is deprecated as it’s landed a long time ago as a native util., so don’t use and depend on it.
Could you post a minimal and executable code snippet reproducing the issue?