Seeking distributed async GPU guidance

JimFan · August 15, 2017, 9:59pm

I would like to train A3C or distributed DQN on GPU with the new torch.distributed API. These algorithms boil down to writing GPU Hogwild training correctly. Papers like Elastic SGD also needs async GPU code to reproduce.

I used to work with Tensorflow distributed mode, which has a whole collection of abstractions and wrappers to implement async training. https://www.tensorflow.org/deploy/distributed
A decent implementation includes parameter servers and high-level managers to take care of gradient communication, parameter syncing, and shared Adam/Adagrad optimizers, for example.

Unfortunately, I cannot find any official tutorials or example code that show how to write a basic GPU Hogwild with parameter servers. The torch.distributed primitives are too low-level to use correctly. I looked at the source code of torch.nn.parallel.DistributedDataParallel, hoping to get some inspiration. It’s also too involved for me to understand and rewrite for my use case.

I understand that the distributed mode is very new. I’d really appreciate it if anyone can give some guidance on how to emulate TF’s distributed semantics in pytorch. For example, what are the main steps and which MPI primitives should be used in each step? Ideally, I’d love to see some skeleton code. I can figure out the rest of the details by myself, but I need something to start with. Thanks in advance!

dgriff · August 15, 2017, 10:35pm

I don’t believe it is possible to do async hog wild training with gpu. The point with that type of training is to exploit some of the benefits CPU use has over gpu. You could do a like-a3c batch training where training is parallelized i.e. but global model is updated synchronously not asynchronously.

Ps there is no a3c Gpu hogwild training in tensorflow or pytorch on github to my knowledge

JimFan · August 15, 2017, 10:37pm

Why isn’t it possible? Each worker can send its gradient to the central parameter server without first waiting for the other workers and then do all_reduce. The Tensorflow example code shows just that.

dgriff · August 15, 2017, 10:40pm

Ah but see what you just explained is a queue and parameters are updated synchronously not asynchronously

dgriff · August 15, 2017, 10:42pm

Hogwild is lock free training

I think what you want if you want use gpu is something like batch-a3c where you have good examples here in pytorch:

And tensorflow:

JimFan · August 15, 2017, 10:43pm

Maybe Hogwild isn’t the right term, but here’s what I want to achieve:

Take distributed DQN as an example. Each worker does the following repeatedly:

Each has a local copy of the global parameter.
Interact with the Atari simulator, sample experience from replay memory, and compute gradient on GPU. The policy network can potentially be a big convnet, so GPU will accelerate a lot.
Send the gradient to the parameter server. The PS uses shared Adagrad or whatever to update the central parameter copy.
Pull from PS to update the local parameter copy.

This is the async GPU training I want to implement. There are many use cases outside RL too, like elastic SGD.

JimFan · August 15, 2017, 10:47pm

I think the repos you mentioned solve a different problem. I don’t want to batch the experience collected from the game simulators and do computation on only one GPU. I want to send the gradients over to parameter servers asynchronously, like what Tensorflow’s code is doing conceptually.

The project I’m working on is not exactly A3C or DQN, so I need a more general async GPU skeleton code to work with. But thank you all the same for the links!

dgriff · August 15, 2017, 10:56pm

I believe elf does with gpu as well. Believe same model. The same people made both. But when updating global model on gpu to be shared I believe there needs to be locks so updated synchronously cause only atomic operations can be done without locks on gpu

JimFan · August 15, 2017, 11:05pm

Thanks for the links. ELF is not written in pytorch and is quite heavy-weight. It’d be much more illuminating to see minimal example code with torch.distributed to reproduce at least part of the TF parameter server + async training logic. Furthermore, the async GPU skeleton code can be reused over and over again in many cases.

dgriff · August 15, 2017, 11:09pm

There is a pytorch model in the RL pytorch folder. There reason I say you want batch a3c because if you are updating parameters individually with gpu. The lock acquiring and releasing will slow it down so that it’s no faster than if done on CPU only if you do a whole bunch of updates together and get benefit of much faster matrix computation of gpu will it be beneficial at all

To do individual updates using gpu will be in most all cases slower except if your model is extremely large

JimFan · August 15, 2017, 11:19pm

For DQN it’s not the case, because it’s already batched. GPU makes a big speed difference over CPU on a single thread in my experiments.

I’m just looking for a way to reproduce TF’s async distributed semantics in torch.distributed rather than coming up with workarounds. It’s good to understand how to put those primitives together in a correct way, so that I have more control over the communications. There isn’t a single tutorial code on how to use torch.distributed primitives in real settings. The ImageNet example (released with v0.2) uses the nn.parallel.DistributedDataParallel wrapper whose internals are quite obscure.

dgriff · August 15, 2017, 11:37pm

But when training these models the bottleneck is data. The model needs to perform an action to get next values and then update. That is small amount of computation so the added speed of doing on gpu from that is lossed from slower sharing updates on gpu compared to CPU. Only by collecting a bunch of these updates and updating all at once and losing all the slow individual shared updates will it be beneficial. So yes you can do that but why if it will be no faster is what I’m trying to express.

JimFan · August 15, 2017, 11:46pm

I think if each DQN learner’s batch is big enough, the overall speed will be much faster even if they have to lose some cycles sending the gradient and downloading updates from the parameter server. That’s what Deepmind did in their GORILA paper. “Algorithm 1” on page 5.

I can imagine how the code would be written in Tensorflow’s async framework. But it’s not obvious how to translate into PyTorch’s distributed primitives.

chauhan-utk · August 29, 2017, 4:58am

Hey, did you make the distributed PyTorch code? I am also trying with the same and except the imagenet example, I am not able to find an example for distributed machine learning in PyTorch.

JimFan · August 29, 2017, 7:22pm

Not yet. Even if I did, my code would very likely be suboptimal since I’m unfamiliar with distributed computing. That’s why an official tutorial/repo of examples would really help.

QuantScientist · August 29, 2017, 7:45pm

Not sure if this is what you want to do, but there is a torch.distributed example here:
https://ptorch.com/news/40.html