Asynchronous parameters updating?

The background is A3C algorithm, where many worker threads share a common network parameters and share a common rmsprop states, with each thread holding its own gradParameters. Periodically, each worker thread updates the common parameters using the common rmsprop states with its own gradParameters in a lock-free, asynchronous way.

Previously in Torch 7, it’s rather easy to do this with threads and optim library:

-- in main thread: shared parameters
params, _ = sharedNet:getParameters()

-- in worker thread: its own gradParameters
tNet = sharedNet:clone()
_, gradParams = tNet:getParameters()

-- in worker thread: stuff

-- in worker thread: updating shared parameters with its own gradParameters
function feval() return nil, gradParams end
optim.rmsprop(feval, params, sharedStates)

But I don’t see an obvious way to do the same thing with pytorch, because now the parameters and gradParameters are tied together under the nn.Parameter class… Any suggestion for it? (I find the mnist_hogwild.py example but the updating details are different from what I described above.)

Thanks in advance!

1 Like

I have already implemented A3C in pytorch, and it works just fine. When you get a copy with everything shared in the subprocess just do this to break the gradient sharing, and use the optimizer as you’d normaly do:

for param in model.parameters():
    param.grad.data = param.grad.data.clone()

This is also covered in the notes, that I encourage you to read.

Hi @apaszke thanks for the code!

As you mentioned your private impl of A3C, I have one more simple question: according to the paper, A3C only synchronizes local network parameters periodically with the shared network parameters (in contrast to Asynchronous N-step Q where the network parameters is always synchronized). To do this, did you write code like:

# after every several steps (e.g., 5 or 20)
for t_param, shared_param in zip(t_model.parameters(), shared_model.parameters()):
    t_param.data.copy_(shared_param.data)

Or did you find it not critical to the accuracy in your implementation? (Previously, I strictly follow the paper and I can reproduce the scoring curve for breakout as in the paper’s figure)

Yes, exactly. I used to have such code, but then I started distributing only state_dicts and now it looks more like this:

t_model.load_state_dict(shared_state_dict)

As said before, this doesn’t make the t_model parameters shared, but only copies the content of shared_state_dict.

But your solution is valid too. Of course you have to apply updates to shared_model as well.

Great, thanks so much @apaszke! load_state_dict looks much better!

@apaszke adding A3C implementation to examples repo will be of great help. RIght now, I couldn’t find any examples using state_dicts to share parameters

You don’t need to speciffically use state_dicts, sharing models is fine too. Did you look in the notes? We’ll probably add A3C to examples sooner or later, but I can’t promise when.

I implemented A3C following the hogwild example:

It converges on PongDeterministic-v3 in 10 minutes with 16 threads. However, it work poorly on PongDeterministic-v3.

Could you please take a look? I wonder whether I got asynchronous updates in PyTorch right.

I don’t think I’d recommend reusing the same grad tensors in multiple Variables, but apart form that I can’t see anything wrong at a glance.

The problem was in the architecture that I used initially, seems to work on Breakout now.

Reusing the same grad tensor might cause some problems?

Don’t think there’s anything that could go wrong at the moment, but I can’t guarantee that will change. An official version is that you never should have two Variables having the same data, and var.grad is a Variable.

Hi IIya, @Ilya_Kostrikov I’m experimenting with your wonderful A3C implementation - I’m porting over a custom environment (basically blackbox optimization) I’ve got working in TensorFlow. I keep running int this error though,

AttributeError: ‘float’ object has no attribute ‘backward’

Any guesses what’s caused it? Apart from my custom environment, the only difference to your code is I’ve remove the conv layers, and used a 2 layer LSTM - which is the model I’m using in TensorFlow.

As a performance improvement have you tried concatenating the previous action, reward, and a timestep counter onto the end of the state as 3 scalars - I noticed a significant improvement in my TF implementation when I do this.

@apaszke - would it be possible to have a look at your A3C implementation - I’ve spent nearly a whole day trying to debug mine?

Thanks a lot for your help :slight_smile:

1 Like

@AjayTalati my implementation is part of a larger project and is a bit specific to it, sorry. The problem you’re facing is that you think you’re working with a Variable, while you actually have a float object. If you upload the code somewhere and let me run it I’ll tell you where that is.

Hi, @apaszke, thanks for replying. Indeed, I got confused which parts of the A3C algorithm should be Variables? There’s another open source PyTorch A3C implementation, by

rarilurelo/pytorch_a3c

Which I’m working through, which seems closer to my TF implementation. Hopefully I’ll figure it out soon.

At the moment my environment code, (I guess like yours), is stuck in the middle of a larger project, and is not stand alone, yet. This is actually the reason why I’m moving away from TF.

Hopefully, I’ll be able to wrap my environment into something like an OpenAI env, and be able to open source it on Github within a few days, and get back to you. It’s a nice example of graph optimisation/travelling salesman, so it should be of interest to quite a few people.

All the best,

Aj

@AjayTalati sounds good! Let me know once it’s public!

About Variables, you only need to use the for parts that will require differentiation, and it’s best to not use them for everything else. If you have any questions about any specific cases, I’ll be happy to answer.

you mean

for (shared, local) in zip(shared_net.parameters(), local_net.parameters()):
shared_param.grad.data = local_param.grad.data.clone()

right?

@Ilya_Kostrikov - Quick question about your implementation:
Sorry in case it seems obvious to you, but I admittedly don’t have in-depth knowledge about (parallel) optimization algorithms. In your implementation, you moved the init of step, square_avg etc from step to __init__.

Is there another reason for this apart form the obvious, that we can’t call share_memory before calling step for the first time?

Cheers,
Deniz

Hi @apaszke

Sr to bring this up again, but no one answers my question about this.
When following the example code of Hogwild, I found out that the model’s params can be partially changed when I’m doing the forwarding on it.
Do you use any kinds of semaphore or mutex when optim.step or forwarding the model?