Hi, I followed this post: Distributed Pytorch Tutorial in order to run my model on two gpus. However, when i run it I got the following message "fused_weight_gradient_mlp_cuda
module not found. gradient accumulation fusion with weight gradient computation disabled." It does not say if it is a warning or error, and the script does nothing, meaning that the model is not running. Does anyone have an idea of what might be the issue?
fused_weight_gradient_mlp_cuda
is a custom layer used in apex/Megatron
and is optional if you build the apex
repository in your setup.
I couldn’t quickly find any reference to Megatron in the blog post, but it seems that the author was using apex
in 2019 for the automatic mixed-precision util. (this is not necessary anymore as you should use torch.cuda.amp
in newer PyTorch releases) and might be using these custom model/pipeline-parallel utils now.
@ptrblck thank you for clarifying this. Actually, I realized that the reason for not running my program is the followin. Before the spawn of training code to different nodes, there is this piece of code:
args.world_size = args.gpus * args.nodes os.environ['MASTER_ADDR'] = args.MASTER_ADDRESS os.environ['MASTER_PORT'] = '8088' mp.spawn(training, nprocs=args.gpus, args=(args,))
Which uses port 8088. When my code initializes visdom server, on port ‘8097’ it suddenly refuses to continue and stalls. Is there any workaround on this? I tried to change port but nothing happens.