I am trying to sample from a variable so I can apply the reinforce algorithm on a toy problem.
What I found was that sampling more than one scalar from a probability distribution yields spurious errors when the log probability is being computed.
Here is an example:
x = Variable(torch.Tensor([[0.1, 0.2, 0.1, 0.25, 0.25, 0.1]]), requires_grad=True)
print(x.size())
m = Categorical(x)
action = m.sample_n(5)
print('action: ', action.size())
# next_state, reward = env.step(action)
loss = -m.log_prob(action.unsqueeze(0)) #* reward
print('loss: ', loss)
loss.backward()
We get
torch.Size([1, 6])
action: torch.Size([5, 1])
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-115-d59bfdfba3ea> in <module>()
6 print('action: ', action.size())
7 # next_state, reward = env.step(action)
----> 8 loss = -m.log_prob(action.unsqueeze(0)) #* reward
9 print('loss: ', loss)
10 loss.backward()
~/anaconda3/envs/py35/lib/python3.5/site-packages/torch/distributions.py in log_prob(self, value)
151 return p.gather(-1, value).log()
152
--> 153 return p.gather(-1, value.unsqueeze(-1)).squeeze(-1).log()
154
155
RuntimeError: invalid argument 4: Index tensor must have same dimensions as input tensor at /opt/conda/conda-bld/pytorch_1512383260527/work/torch/lib/TH/generic/THTensorMath.c:503
I looked into the source code for Categorical and nothing seems out of place.
Solving the reinforce problem by meself yields no errors:
m = Categorical(x)
action = m.sample_n(5)
p = -x/x.sum(-1, keepdim=True)
reward = 1
loss = p.log()* reward
print('loss ', loss.size())
loss.mean().backward()
Output:
action: torch.Size([5, 1])
loss torch.Size([1, 6])
Could someone please help?