Error when running DistributedDataParallel

watchdog90 · February 28, 2022, 1:41pm

Hi, I followed this post: Distributed Pytorch Tutorial in order to run my model on two gpus. However, when i run it I got the following message "fused_weight_gradient_mlp_cuda module not found. gradient accumulation fusion with weight gradient computation disabled." It does not say if it is a warning or error, and the script does nothing, meaning that the model is not running. Does anyone have an idea of what might be the issue?

ptrblck · March 1, 2022, 6:31am

fused_weight_gradient_mlp_cuda is a custom layer used in apex/Megatron and is optional if you build the apex repository in your setup.
I couldn’t quickly find any reference to Megatron in the blog post, but it seems that the author was using apex in 2019 for the automatic mixed-precision util. (this is not necessary anymore as you should use torch.cuda.amp in newer PyTorch releases) and might be using these custom model/pipeline-parallel utils now.

watchdog90 · March 1, 2022, 8:46am

@ptrblck thank you for clarifying this. Actually, I realized that the reason for not running my program is the followin. Before the spawn of training code to different nodes, there is this piece of code:

            args.world_size = args.gpus * args.nodes
            os.environ['MASTER_ADDR'] = args.MASTER_ADDRESS            
            os.environ['MASTER_PORT'] = '8088'  
            mp.spawn(training, nprocs=args.gpus, args=(args,))

Which uses port 8088. When my code initializes visdom server, on port ‘8097’ it suddenly refuses to continue and stalls. Is there any workaround on this? I tried to change port but nothing happens.