How to convert a Variable[0.8,0.1,0.1,0] to [1,0,0,0] (since 0.8 is max number,max number is 1, else is 0), in a computational graph(with gradient back prop)?

brisker · November 13, 2017, 9:08am

How to convert a Variable[0.8,0.2,0] to [1,0,0] (since 0.8 is max number,max number is 1, else is 0), in a computational graph(with gradient back prop)?

alexis-jacq · November 13, 2017, 9:26am

v = torch.autograd.Variable(torch.Tensor([0.8,0.1,0.1,0]), requires_grad=True)
out = (v==torch.max(v)).float()

brisker · November 13, 2017, 9:29am

RuntimeError: inconsistent tensor size at /home/jcc/pytorch/torch/lib/TH/generic/THTensorMath.c:2668

brisker · November 13, 2017, 9:29am

error, in the second line

alexis-jacq · November 13, 2017, 9:30am

Wich version of pytorch are you using ?

In older versions, you had to compare two tensors of the same size:

(v==torch.max(v).expand_as(v)).float()

brisker · November 13, 2017, 9:34am

my version is 0.1.11, thanks, it works!

brisker · December 6, 2017, 12:37pm

given m1=nn.Linear(100,50), and m1 converts Variable A(4 * 100) to VariableB(4 * 50), and suppose the parameters
of m1 is W1(100 * 50 tensor) and b1(50 * 1 tensor).
So if I take W1 as a Variable, and given C(100*50 tensor) and do something like:

B = m1(A)
D = W1+C
loss1 = loss_func1(D,target1)
loss2 = loss_func2(B,target2)
loss=loss1+loss2
loss.backward()

what is the gradient like for W1, given the fact it is the parameters of m1, not purely Variable? Anything special?

alexis-jacq · December 6, 2017, 1:57pm

Nothing special I suppose. For example with an MSELoss, the gradient of W1 should be something like (ignoring dimensions) :

W.grad = d(l1)/d(W) + d(l2)/d(W) = 2 (W+C-T1) + 2 (W*A+b1-T2)*A

alexis-jacq · December 6, 2017, 2:29pm

You can even check:

M = nn.Linear(100,50)
W = M.weight
criterion = nn.MSELoss(M.parameters())

C = Variable(torch.rand(50,100),requires_grad=True)
A = Variable(torch.rand(4,100),requires_grad=True)


T1 = Variable(torch.rand(50,100))
T2 = Variable(torch.rand(4,50))

B = M(A)
D = W + C
l1 = torch.sum((D-T1)**2)
l2 = torch.sum((B-T2)**2)
l = l1 + l2
l.backward()

x = 2*(D - T1) 
y = (B - T2).transpose(0,1).unsqueeze(0)
z = A.unsqueeze(0)
t = 2*torch.bmm(y,z).squeeze()

grad_test = x+t # 2 (W+C-T1) + 2 (W*A+b1-T2)*A

print(torch.sum((W.grad-grad_test)**2))

Variable containing:
0
[torch.FloatTensor of size 1]

SimonW · December 6, 2017, 4:07pm

The original thing you wanted is not differentiable

brisker · December 6, 2017, 4:20pm

what do you mean? What is the “original thing”?

SimonW · December 6, 2017, 4:22pm

I meant this argmax-like operation is not differentiable.

brisker · December 6, 2017, 4:27pm

how about this operation:

  v = torch.autograd.Variable(torch.Tensor([0.8,0.1,0.1,0]), requires_grad=True)
  out = (v==torch.max(v).expand_as(v)).float()

does variable v and out have gradients?

@alexis-jacq
@SimonW

SimonW · December 6, 2017, 4:29pm

Isn’t that the same thing? The expand_as call doesn’t do anything here.

And also, argmax operation is never differentiable.

brisker · December 6, 2017, 4:40pm

thanks for this post:here
and the code you provided below:

M = nn.Linear(100,50)
W = M.weight
criterion = nn.MSELoss(M.parameters())

    C = Variable(torch.rand(50,100),requires_grad=True)
A = Variable(torch.rand(4,100),requires_grad=True)


T1 = Variable(torch.rand(50,100))
T2 = Variable(torch.rand(4,50))

B = M(A)
D = W + C
l1 = torch.sum((D-T1)**2)
l2 = torch.sum((B-T2)**2)
l = l1 + l2
l.backward()

x = 2*(D - T1) 
y = (B - T2).transpose(0,1).unsqueeze(0)
z = A.unsqueeze(0)
t = 2*torch.bmm(y,z).squeeze()

grad_test = x+t # 2 (W+C-T1) + 2 (W*A+b1-T2)*A

print(torch.sum((W.grad-grad_test)**2))

Variable containing:
0
[torch.FloatTensor of size 1]

besides, if I want to select (slicing) one column of W1(100,50) matrix: W1_c(100 * 1 Variable), and obviously converting W1 to W1_c needs a (50 * 1) Variable, with only one 1-value and 49 0-value,right? (like [0,0,0,0,1,0,0,…,0,0])Let’s call this Variable Select_W. So what if this Select_W is created by something like:

    v = torch.autograd.Variable(torch.Tensor([0.8,0.1,0.1,0.1,......,0.3]), requires_grad=True)
   Select_W = (v==torch.max(v).expand_as(v)).float()
   More_0 = W1 * Select_W 
   loss3 = loss_func3(More_0,target3)

So what is the gradients flow like among Variable Select_W, More_0,Variable v and the Variable A, B,etc? No gradients for Select_W Variable? Variable that created by slicing operation, does not have gradients? But this computational graph actually can be trained, I think. But I am confused by the gradient flow.
@SimonW
@alexis-jacq

alexis-jacq · December 6, 2017, 5:34pm

You can check that out.requires_grad=False. @SimonW is right by notifying argmax (but here it’s rather an indicator function than the true “argmax”) is not differentiable. However, it’s possible to set the derivative of the function as f'(x)=0 for all x.

For instance,
f(x,t) = Softmax(x*t) --> argmax(x) when t --> +inf
and respectivly, f'(x,t) --> 0

One simple solution could be:

v = Variable(torch.Tensor([0.8,0.1,0.1,0]), requires_grad=True)
y = F.softmax(v*100)

brisker · December 6, 2017, 5:53pm

@alexis-jacq
thanks，besides，what about this question？