The accuracy of Hogwild on Multi-GPUs drop dramatically

tengerye · October 7, 2021, 4:05am

The example on Hogwild! gives 99% accuracy, but when I upgrade to multi-gpu versions, it gives 11% accuracy.

The difference is as follows:

# main.py

    model = Net()  # not to specific device any more
    model.share_memory()

    processes = []
    for rank in range(args.num_processes):
        local_device = torch.device(rank%2)
        p = mp.Process(target=train, args=(rank, args, model, local_device,
                                           dataset1, kwargs))
        
        p.start()
        processes.append(p)

And I move the model to device in the subprocesses.

# train.py
def train(rank, args, model, device, dataset, dataloader_kwargs):
    model = model.to(device)  # move to specific device in the sub-process
    torch.manual_seed(args.seed + rank)

    train_loader = torch.utils.data.DataLoader(dataset, **dataloader_kwargs)

    optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
    for epoch in range(1, args.epochs + 1):
        train_epoch(epoch, args, model, device, train_loader, optimizer)

It seems the model is not shared any more.

Where are the mistakes?
What is the correct steps?

H-Huang · October 8, 2021, 4:00pm

Answer to your questions:

Your model is indeed put on multiple devices, but there is no synchronization of gradients during training likely causing the accuracy loss
Look into DDP (Distributed Data Parallel — PyTorch 1.9.1 documentation) which provides a framework to do distributed training across multiple gpus and multiple machines.
Example modifications of your example to fit DDP (I did not test locally, may have typos):

# main.py

    model = Net()  # not to specific device any more

    processes = []
    for rank in range(args.num_processes):
        dist.init_process_group("gloo", rank=rank, world_size=2)
        p = mp.Process(target=train, args=(rank, args, model, dataset1, kwargs))
        
        p.start()
        processes.append(p)

# train.py
def train(rank, args, model, dataset, dataloader_kwargs):
    ddp_model = DDP(model, device_ids=[rank])
    torch.manual_seed(args.seed + rank)

    train_loader = torch.utils.data.DataLoader(dataset, **dataloader_kwargs)

    optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
    for epoch in range(1, args.epochs + 1):
        train_epoch(epoch, args, model, device, train_loader, optimizer)

tengerye · October 10, 2021, 11:45am

@H-Huang Hi, thank you for your kind help. However, it got stuck at dist.init_process_group('nccl', rank=rank, world_size=2).

stevenwjy · October 10, 2021, 12:17pm

CMIIW, but I think DDP is not suitable for Hogwild training?

Because DDP basically synchronizes the gradient across all nodes/processes during the backward operation using all reduce. Hence, all of the models in different processes will have the same gradient and have the same weight at every step of the training (after optimizer step). If you indeed want to use DDP, I think you should create the model inside the train function, not being passed when creating the processes in the main function. Because in the latter case, the model tensor will be treated as a shared memory and it is a bit pointless for all processes to update a shared memory with the same value?

Meanwhile, the idea of Hogwild is that sparse gradient updates (because of multiple processes potentially trying to read and write to the shared memory at the same time) can indeed converge and improve the performance. Hence, it is expected for diff processes to update the model with diff gradients (unlike what happens when using DDP).

To reply the main question:

I don’t think you can write model = model.to(device)? Because all of the processes are essentially sharing the same model?
What dataset are you using? And perhaps how do you implement the train_epoch function? I’m not sure about why there’s a huge drop in accuracy, but do you think it sort of follows the experiment mentioned in the original Hogwild paper?
I believe they only parallelize using multiple cores in the original Hogwild paper (but I may misremember this). If you want to use multiple GPUs, I think you should try another workaround like using parameter server, or having multiple local copies of the model (one per process) that occasionally get synced with the one in shared memory.
If you only want to increase the accuracy, I think it may be better to go ahead with DDP since synchronous training in general have a better track record (easier to train) as compared to using Hogwild and some other styles of async training.

tengerye · October 12, 2021, 7:44am

@stevenwjy Hi, thank you so much for your kind reply. The code is from this repo. I am trying to extend it to multi-gpus version. I think my modification is incorrect, too (as @H-Huang). However, I don’t know what is the proper way.

stevenwjy · October 19, 2021, 1:00pm

model = model.to(device)  # move to specific device in the sub-process

Hmm… I’m actually not too familiar with this, but I assume that in the beginning your model was on the shared memory, but eventually you moved it into the “device” – which I guess may not make the new model shared across processes anymore?