Backpropagate on a stochastic variable

I have a list of tensors from which I am sampling using the torch.multinomial function. I should mention that the list is formed from the last layer of a convnet. I am sampling from this list and forwarding the sampled feature cube through an lstm. The LSTM solves a regression problem by fitting the sampled feature cube to some labeled data set which I provide.

When I call backward() on my loss however, I get

     13     def _do_backward(self, grad_output, retain_variables):
     14         if self.reward is _NOT_PROVIDED:
---> 15             raise RuntimeError("differentiating stochastic functions requires "
        global RuntimeError = undefined
     16                                "providing a reward")
     17         result = super(StochasticFunction, self)._do_backward((self.reward,), retain_variables)

RuntimeError: differentiating stochastic functions requires providing a reward
> /home/lex/anaconda2/envs/py27/lib/python2.7/site-packages/torch/autograd/stochastic_function.py(15)_do_backward()
     13     def _do_backward(self, grad_output, retain_variables):
     14         if self.reward is _NOT_PROVIDED:
---> 15             raise RuntimeError("differentiating stochastic functions requires "
     16                                "providing a reward")
     17         result = super(StochasticFunction, self)._do_backward((self.reward,), retain_variables)

What does it expect as reward?

FWIW, I am training like so:


    clsfx_crit = nn.CrossEntropyLoss()
    regress_crit = nn.MSELoss()

    clsfx_optimizer = torch.optim.Adam(resnet.parameters(), clr)
    rnn_optimizer = optim.SGD(regressor.parameters(), rlr)

    # Train classifier
    for epoch in range(maxIter): #run through the images maxIter times
        for i, (train_X, train_Y) in enumerate(train_loader):

            images   = Variable(train_X)
            labels   = Variable(train_Y)

            #rnn input
            rtargets = targ_X[:,i:i+regressLength,:]
            #reshape targets for inputs
            rtargets = Variable(rtargets.view(regressLength, -1))

            # Forward + Backward + Optimize
            clsfx_optimizer.zero_grad()
            rnn_optimizer.zero_grad()

            #predict classifier outs and regressor outputs
            outputs  = resnet(images)
            routputs = regressor(rtrain_X)

            #compute loss
            loss     = clsfx_crit(outputs, labels)
            rloss    = regress_crit(routputs, rtargets)

            #backward pass
            loss.backward()
            rloss.backward()

            # step optimizer
            clsfx_optimizer.step()
            rnn_optimizer.step()

When using a stochastic function in the graph (such as torch.multinomial), you have to give it a reward before backproping through the graph.

Here’s an example:

2 Likes

Thank you!

I have not done a lot of work in gym. And to be honest, I do not understand why the concept of reward should come up in a logistic regression problem.

Granted that the input to the LSTM is sampled using a multinomial function, the LSTM should not have to think of long term rewards as in the reinforce episodic setting example that you gave above.

If you are saying that I should cast the regression learning problem into the form of a reinforcement learning problem, in my case, what would you suggest my reward should be before I backprop? Targets I predicted during the forward pass? How exactly would you recommend I call rloss.backward() as in the code snippet I gave early on?

Please pardon my many questions but I am a little confused about the introduction of a reward variable into the graph and a little more clarification would greatly suffice.

Sampling from a multinomial distribution is a discrete stochastic operation that you can’t backpropagate through. In other words, if you get exact gradients for your LSTM parameters by backpropagating through the regression and LSTM parts of your network, you can’t continue this process and get exact gradients for your convnet parameters.
You can do either of two other things, through. You can set the parameters of the convnet to be requires_grad=False, so they are fixed and don’t need gradients. Or you can use a strategy for estimating the gradients of the convnet parameters; in order to do this you need to provide the reward (or the negative of the loss) for each sample directly to the stochastic operation node; this is a process that’s typically used in policy gradient reinforcement learning so the API uses terms and concepts from that world.
The math behind this is laid out in a paper called Stochastic Computation Graphs, by Schulman et al.

10 Likes

Thank you very much indeed for the explanation and the paper reference.

I just read through the Schulman’s paper. I am however unclear about certain aspects of your answers:

(i) why did you mention that I have to use the negative of the loss function as an argument when I am calling rloss.backward()?

(ii) What are best practices for choosing reward scalars if I decide against using the negative of the loss for estimating the gradients?

Would appreciate your response.

Thanks!

loss goes down, reward goes up. Hence reward can be negative of the loss function

2 Likes

Thanks!

And sorry for the much bother. I am doing the backprop like this:

            #predict classifier outs and regressor outputs
            outputs  = resnet(images)
            routputs = regressor(rtrain_X)

            #compute loss
            loss     = clsfx_crit(outputs, labels)
            rloss    = regress_crit(routputs, rtargets)

            #backward pass
            loss.backward()
            rloss.backward(-rloss)

            # step optimizer
            clsfx_optimizer.step()
            rnn_optimizer.step()

            print ("Epoch [%d/%d], Iter [%d] cLoss: %.8f, rLoss: %.4f" %(epoch+1, maxIter, i+1,
                                                loss.data[0], rloss.data[0]))

My rloss is clearly not a tuple but the stack trace gives this error:

        rloss = Variable containing:
1.00000e+05 *
  4.8973
[torch.cuda.DoubleTensor of size 1 (GPU 0)]

    470 
    471             # step optimizer

/home/lex/anaconda2/envs/py27/lib/python2.7/site-packages/torch/autograd/variable.pyc in backward(self=Variable containing:
1.00000e+05 *
  4.8973
[torch.cuda.DoubleTensor of size 1 (GPU 0)]
, gradient=Variable containing:
1.00000e+05 *
 -4.8973
[torch.cuda.DoubleTensor of size 1 (GPU 0)]
, retain_variables=False)
    144                     'or with gradient w.r.t. the variable')
    145             gradient = self.data.new().resize_as_(self.data).fill_(1)
--> 146         self._execution_engine.run_backward((self,), (gradient,), retain_variables)
        self._execution_engine.run_backward = <built-in method run_backward of torch._C._EngineBase object at 0x7f8dd92b9210>
        self = Variable containing:
1.00000e+05 *
  4.8973
[torch.cuda.DoubleTensor of size 1 (GPU 0)]

        gradient = Variable containing:
1.00000e+05 *
 -4.8973
[torch.cuda.DoubleTensor of size 1 (GPU 0)]

        retain_variables = False
    147 
    148     def register_hook(self, hook):

RuntimeError: element 0 of gradients tuple is not a Tensor or None
> /home/lex/anaconda2/envs/py27/lib/python2.7/site-packages/torch/autograd/variable.py(146)backward()
    144                     'or with gradient w.r.t. the variable')
    145             gradient = self.data.new().resize_as_(self.data).fill_(1)
--> 146         self._execution_engine.run_backward((self,), (gradient,), retain_variables)
    147 
    148     def register_hook(self, hook):

What type of data does backward really expect?

calling the backward with an explicitly specified Tensor variable does not help either:

            rloss.backward(torch.Tensor([1]).cuda())

gives

    147 
    148     def register_hook(self, hook):

/home/lex/anaconda2/envs/py27/lib/python2.7/site-packages/torch/autograd/stochastic_function.pyc in _do_backward(self=<torch.autograd._functions.stochastic.Multinomial object>, grad_output=(), retain_variables=True)
     13     def _do_backward(self, grad_output, retain_variables):
     14         if self.reward is _NOT_PROVIDED:
---> 15             raise RuntimeError("differentiating stochastic functions requires "
        global RuntimeError = undefined
     16                                "providing a reward")
     17         result = super(StochasticFunction, self)._do_backward((self.reward,), retain_variables)

RuntimeError: differentiating stochastic functions requires providing a reward
> /home/lex/anaconda2/envs/py27/lib/python2.7/site-packages/torch/autograd/stochastic_function.py(15)_do_backward()
     13     def _do_backward(self, grad_output, retain_variables):
     14         if self.reward is _NOT_PROVIDED:
---> 15             raise RuntimeError("differentiating stochastic functions requires "
     16                                "providing a reward")
     17         result = super(StochasticFunction, self)._do_backward((self.reward,), retain_variables)

see the example that i pointed you to, you are doing this wrong. you need to call .reinforce on the stochastic outputs before calling backward.

Thank you very kindly. I indeed see my error.

Also, I noticed in the example that the saved_actions are drawn from a multinomial distribution (hence it is the stochastic variable). I am not all too familiar with gym’s API calls but it seems the rewards variable is the result of a deterministic env.step(...) call. Am I correct?

I have a similar code structure. I sample from a list of tensors using the multinomial function and then save the results from the sampling into a list of actions in my network model class.

What should the reward variable in my situation be? The output when I call forward on my stochastic variable by my lstm regressor? If I do that, it rightly raises a runtime error:

    raise RuntimeError("reinforce() can be only called on outputs "
RuntimeError: reinforce() can be only called on outputs of stochastic functions

So actions in your example is a stochastic variable, I agree. But you called action.reinforce on r and everything is honky dory. I guess my question is what is the rationale for doing this

    R = 0
    rewards = []
    for r in policy.rewards[::-1]:
        R = r + args.gamma * R
        rewards.insert(0, R)
    rewards = torch.Tensor(rewards)
    rewards = (rewards - rewards.mean()) / (rewards.std() + np.finfo(np.float32).eps)

and r in policy.rewards[::-1] is stochastic?

In the RL example, action is the immediate result of multinomial (i.e., it’s a stochastic variable) and reward is an ordinary tensor.

1 Like

Could you point me to where this reinforce function is defined and implemented? For some weird reason, I keep getting

RuntimeError: reinforce() can be only called on outputs of stochastic functions

when I call reinforce as follows:

"""
   regress input is the output of a multinomial sampling i.e. Variable containing: torch.DoubleTensor of size 4096x1
   r is a Variable containing: 0.2610 [torch.DoubleTensor of size 1]
"""
   regress_input.reinforce(r)

Looks like the problem is coming from me. r was a LongTensor in my case. Casting it to double seems to fix the issue.

Still, I do not understand how rewards are generated as in the reinforce.py example that @smth earlier mentioned. In the gym example, they seem to the the output of the env.step() function. In my case, I am extracting the last layer of a convnet, sampling from this layer to generate the input to a regressor network. In order to call backward on the stochastic input, I have learnt I have to compute the rewards by calling .reinforce(). I am not sure where the rewards should be populated in my framework. For now, I am doing something like this:

    R, rewards = 0,  []
    for r in policy.rewards[::-1]:   #my policy.rewards is initially an empty list as in the example
        R = r + args.gamma * R
        rewards.insert(0, R)
    rewards = torch.Tensor(rewards)
    #...   ...  ...   ....  ...

From Ronald Williams paper, the REINFORCE algorithm is given by:

\Delta w_{ij} = \alpha_{ij} (r - b_{ij} ) e_{ij}

When we call the StochasticVariable.reinforce(someTensor) function in pytorch, what is going on under the hood? How are the reinforcement baseline b_{ij}, characteristic eligibility, e_{ij} and the learning rate factor, \alpha_{ij} computed?

I would like to know to be sure I am doing the correct thing. Sorry for my troubles but thanks for your help so far. I seem to be getting there.