sangyx
(Yunxin Sang)
September 24, 2020, 2:36pm
1
Hi, my code is in https://github.com/sangyx/dgl/tree/master/examples/pytorch/GATNE-T/src .
I run the main_sparse.py will get acc 0.94. But the acc will not higher than 85% with main_sparse_multi_gpu.py even I set the gpu=0.
Is there any error in my code?
My environment is Pytorch 1.6 and dgl-cu10.2 0.52. You can get the test data example in https://github.com/sangyx/dgl/tree/master/examples/pytorch/GATNE-T
mrshenli
(Shen Li)
October 5, 2020, 5:26pm
2
Hey @sangyx , when using DDP, you might need to tune the batch size and learning rate a bit. See the discussion below:
Assume we have two nodes: node-A and node-B, each has 4gpus(i.e. ngpu_per_node=4). We set args.batch_size = 256 on each node, means that we want each node process 256 images in each forward.
(1) If we use DistributedDataparallel with 1gpu-per-process mode, shall we manually divide the batchsize by ngpu_per_node in torch.utils.data.DataLoader : torch.utils.data.DataLoader(batch_size = args.batch_size / 4)(the way used in pytorch-imagenet-official-example ). In my original opinion, I think Distrib…