How to convert a Variable[0.8,0.1,0.1,0] to [1,0,0,0] (since 0.8 is max number,max number is 1, else is 0), in a computational graph(with gradient back prop)?


#1

How to convert a Variable[0.8,0.2,0] to [1,0,0] (since 0.8 is max number,max number is 1, else is 0), in a computational graph(with gradient back prop)?


(Alexis David Jacq) #2
v = torch.autograd.Variable(torch.Tensor([0.8,0.1,0.1,0]), requires_grad=True)
out = (v==torch.max(v)).float()

#3

RuntimeError: inconsistent tensor size at /home/jcc/pytorch/torch/lib/TH/generic/THTensorMath.c:2668


#4

error, in the second line


(Alexis David Jacq) #5

Wich version of pytorch are you using ?

In older versions, you had to compare two tensors of the same size:

(v==torch.max(v).expand_as(v)).float()

#6

my version is 0.1.11, thanks, it works!


#7

given m1=nn.Linear(100,50), and m1 converts Variable A(4 * 100) to VariableB(4 * 50), and suppose the parameters
of m1 is W1(100 * 50 tensor) and b1(50 * 1 tensor).
So if I take W1 as a Variable, and given C(100*50 tensor) and do something like:

B = m1(A)
D = W1+C
loss1 = loss_func1(D,target1)
loss2 = loss_func2(B,target2)
loss=loss1+loss2
loss.backward()

what is the gradient like for W1, given the fact it is the parameters of m1, not purely Variable? Anything special?


(Alexis David Jacq) #8

Nothing special I suppose. For example with an MSELoss, the gradient of W1 should be something like (ignoring dimensions) :

W.grad = d(l1)/d(W) + d(l2)/d(W) = 2 (W+C-T1) + 2 (W*A+b1-T2)*A

(Alexis David Jacq) #9

You can even check:

M = nn.Linear(100,50)
W = M.weight
criterion = nn.MSELoss(M.parameters())

C = Variable(torch.rand(50,100),requires_grad=True)
A = Variable(torch.rand(4,100),requires_grad=True)


T1 = Variable(torch.rand(50,100))
T2 = Variable(torch.rand(4,50))

B = M(A)
D = W + C
l1 = torch.sum((D-T1)**2)
l2 = torch.sum((B-T2)**2)
l = l1 + l2
l.backward()

x = 2*(D - T1) 
y = (B - T2).transpose(0,1).unsqueeze(0)
z = A.unsqueeze(0)
t = 2*torch.bmm(y,z).squeeze()

grad_test = x+t # 2 (W+C-T1) + 2 (W*A+b1-T2)*A

print(torch.sum((W.grad-grad_test)**2))

Variable containing:
0
[torch.FloatTensor of size 1]


(Simon Wang) #10

The original thing you wanted is not differentiable


#12

what do you mean? What is the “original thing”?


(Simon Wang) #13

I meant this argmax-like operation is not differentiable.


#14

how about this operation:

  v = torch.autograd.Variable(torch.Tensor([0.8,0.1,0.1,0]), requires_grad=True)
  out = (v==torch.max(v).expand_as(v)).float()

does variable v and out have gradients?

@alexis-jacq
@SimonW


(Simon Wang) #15

Isn’t that the same thing? The expand_as call doesn’t do anything here.

And also, argmax operation is never differentiable.


#16

thanks for this post:here
and the code you provided below:

M = nn.Linear(100,50)
W = M.weight
criterion = nn.MSELoss(M.parameters())

    C = Variable(torch.rand(50,100),requires_grad=True)
A = Variable(torch.rand(4,100),requires_grad=True)


T1 = Variable(torch.rand(50,100))
T2 = Variable(torch.rand(4,50))

B = M(A)
D = W + C
l1 = torch.sum((D-T1)**2)
l2 = torch.sum((B-T2)**2)
l = l1 + l2
l.backward()

x = 2*(D - T1) 
y = (B - T2).transpose(0,1).unsqueeze(0)
z = A.unsqueeze(0)
t = 2*torch.bmm(y,z).squeeze()

grad_test = x+t # 2 (W+C-T1) + 2 (W*A+b1-T2)*A

print(torch.sum((W.grad-grad_test)**2))

Variable containing:
0
[torch.FloatTensor of size 1]

besides, if I want to select (slicing) one column of W1(100,50) matrix: W1_c(100 * 1 Variable), and obviously converting W1 to W1_c needs a (50 * 1) Variable, with only one 1-value and 49 0-value,right? (like [0,0,0,0,1,0,0,…,0,0])Let’s call this Variable Select_W. So what if this Select_W is created by something like:

    v = torch.autograd.Variable(torch.Tensor([0.8,0.1,0.1,0.1,......,0.3]), requires_grad=True)
   Select_W = (v==torch.max(v).expand_as(v)).float()
   More_0 = W1 * Select_W 
   loss3 = loss_func3(More_0,target3)

So what is the gradients flow like among Variable Select_W, More_0,Variable v and the Variable A, B,etc? No gradients for Select_W Variable? Variable that created by slicing operation, does not have gradients? But this computational graph actually can be trained, I think. But I am confused by the gradient flow.
@SimonW
@alexis-jacq


(Alexis David Jacq) #17

You can check that out.requires_grad=False. @SimonW is right by notifying argmax (but here it’s rather an indicator function than the true “argmax”) is not differentiable. However, it’s possible to set the derivative of the function as f'(x)=0 for all x.

For instance,
f(x,t) = Softmax(x*t) --> argmax(x) when t --> +inf
and respectivly, f'(x,t) --> 0

One simple solution could be:

v = Variable(torch.Tensor([0.8,0.1,0.1,0]), requires_grad=True)
y = F.softmax(v*100)

#18

@alexis-jacq
thanks,besides,what about this question?