I have a list of tensors from which I am sampling using the torch.multinomial function. I should mention that the list is formed from the last layer of a convnet. I am sampling from this list and forwarding the sampled feature cube through an lstm. The LSTM solves a regression problem by fitting the sampled feature cube to some labeled data set which I provide.
When I call backward() on my loss however, I get
13 def _do_backward(self, grad_output, retain_variables):
14 if self.reward is _NOT_PROVIDED:
---> 15 raise RuntimeError("differentiating stochastic functions requires "
global RuntimeError = undefined
16 "providing a reward")
17 result = super(StochasticFunction, self)._do_backward((self.reward,), retain_variables)
RuntimeError: differentiating stochastic functions requires providing a reward
> /home/lex/anaconda2/envs/py27/lib/python2.7/site-packages/torch/autograd/stochastic_function.py(15)_do_backward()
13 def _do_backward(self, grad_output, retain_variables):
14 if self.reward is _NOT_PROVIDED:
---> 15 raise RuntimeError("differentiating stochastic functions requires "
16 "providing a reward")
17 result = super(StochasticFunction, self)._do_backward((self.reward,), retain_variables)
What does it expect as reward?
FWIW, I am training like so:
clsfx_crit = nn.CrossEntropyLoss()
regress_crit = nn.MSELoss()
clsfx_optimizer = torch.optim.Adam(resnet.parameters(), clr)
rnn_optimizer = optim.SGD(regressor.parameters(), rlr)
# Train classifier
for epoch in range(maxIter): #run through the images maxIter times
for i, (train_X, train_Y) in enumerate(train_loader):
images = Variable(train_X)
labels = Variable(train_Y)
#rnn input
rtargets = targ_X[:,i:i+regressLength,:]
#reshape targets for inputs
rtargets = Variable(rtargets.view(regressLength, -1))
# Forward + Backward + Optimize
clsfx_optimizer.zero_grad()
rnn_optimizer.zero_grad()
#predict classifier outs and regressor outputs
outputs = resnet(images)
routputs = regressor(rtrain_X)
#compute loss
loss = clsfx_crit(outputs, labels)
rloss = regress_crit(routputs, rtargets)
#backward pass
loss.backward()
rloss.backward()
# step optimizer
clsfx_optimizer.step()
rnn_optimizer.step()
I have not done a lot of work in gym. And to be honest, I do not understand why the concept of reward should come up in a logistic regression problem.
Granted that the input to the LSTM is sampled using a multinomial function, the LSTM should not have to think of long term rewards as in the reinforce episodic setting example that you gave above.
If you are saying that I should cast the regression learning problem into the form of a reinforcement learning problem, in my case, what would you suggest my reward should be before I backprop? Targets I predicted during the forward pass? How exactly would you recommend I call rloss.backward() as in the code snippet I gave early on?
Please pardon my many questions but I am a little confused about the introduction of a reward variable into the graph and a little more clarification would greatly suffice.
Sampling from a multinomial distribution is a discrete stochastic operation that you can’t backpropagate through. In other words, if you get exact gradients for your LSTM parameters by backpropagating through the regression and LSTM parts of your network, you can’t continue this process and get exact gradients for your convnet parameters.
You can do either of two other things, through. You can set the parameters of the convnet to be requires_grad=False, so they are fixed and don’t need gradients. Or you can use a strategy for estimating the gradients of the convnet parameters; in order to do this you need to provide the reward (or the negative of the loss) for each sample directly to the stochastic operation node; this is a process that’s typically used in policy gradient reinforcement learning so the API uses terms and concepts from that world.
The math behind this is laid out in a paper called Stochastic Computation Graphs, by Schulman et al.
Also, I noticed in the example that the saved_actions are drawn from a multinomial distribution (hence it is the stochastic variable). I am not all too familiar with gym’s API calls but it seems the rewards variable is the result of a deterministic env.step(...) call. Am I correct?
I have a similar code structure. I sample from a list of tensors using the multinomial function and then save the results from the sampling into a list of actions in my network model class.
What should the reward variable in my situation be? The output when I call forward on my stochastic variable by my lstm regressor? If I do that, it rightly raises a runtime error:
raise RuntimeError("reinforce() can be only called on outputs "
RuntimeError: reinforce() can be only called on outputs of stochastic functions
So actions in your example is a stochastic variable, I agree. But you called action.reinforce on r and everything is honky dory. I guess my question is what is the rationale for doing this
R = 0
rewards = []
for r in policy.rewards[::-1]:
R = r + args.gamma * R
rewards.insert(0, R)
rewards = torch.Tensor(rewards)
rewards = (rewards - rewards.mean()) / (rewards.std() + np.finfo(np.float32).eps)
Could you point me to where this reinforce function is defined and implemented? For some weird reason, I keep getting
RuntimeError: reinforce() can be only called on outputs of stochastic functions
when I call reinforce as follows:
"""
regress input is the output of a multinomial sampling i.e. Variable containing: torch.DoubleTensor of size 4096x1
r is a Variable containing: 0.2610 [torch.DoubleTensor of size 1]
"""
regress_input.reinforce(r)
Looks like the problem is coming from me. r was a LongTensor in my case. Casting it to double seems to fix the issue.
Still, I do not understand how rewards are generated as in the reinforce.py example that @smth earlier mentioned. In the gym example, they seem to the the output of the env.step() function. In my case, I am extracting the last layer of a convnet, sampling from this layer to generate the input to a regressor network. In order to call backward on the stochastic input, I have learnt I have to compute the rewards by calling .reinforce(). I am not sure where the rewards should be populated in my framework. For now, I am doing something like this:
R, rewards = 0, []
for r in policy.rewards[::-1]: #my policy.rewards is initially an empty list as in the example
R = r + args.gamma * R
rewards.insert(0, R)
rewards = torch.Tensor(rewards)
#... ... ... .... ...
From Ronald Williams paper, the REINFORCE algorithm is given by:
\Delta w_{ij} = \alpha_{ij} (r - b_{ij} ) e_{ij}
When we call the StochasticVariable.reinforce(someTensor) function in pytorch, what is going on under the hood? How are the reinforcement baseline b_{ij}, characteristic eligibility, e_{ij} and the learning rate factor, \alpha_{ij} computed?
I would like to know to be sure I am doing the correct thing. Sorry for my troubles but thanks for your help so far. I seem to be getting there.