How to convert a Variable[0.8,0.2,0] to [1,0,0] (since 0.8 is max number,max number is 1, else is 0), in a computational graph(with gradient back prop)?

# How to convert a Variable[0.8,0.1,0.1,0] to [1,0,0,0] (since 0.8 is max number,max number is 1, else is 0), in a computational graph(with gradient back prop)?

**alexis-jacq**(Alexis David Jacq) #2

```
v = torch.autograd.Variable(torch.Tensor([0.8,0.1,0.1,0]), requires_grad=True)
out = (v==torch.max(v)).float()
```

**brisker**#3

RuntimeError: inconsistent tensor size at /home/jcc/pytorch/torch/lib/TH/generic/THTensorMath.c:2668

**alexis-jacq**(Alexis David Jacq) #5

Wich version of pytorch are you using ?

In older versions, you had to compare two tensors of the same size:

```
(v==torch.max(v).expand_as(v)).float()
```

**brisker**#7

given m1=nn.Linear(100,50), and m1 converts Variable A(4 * 100) to VariableB(4 * 50), and suppose the parameters

of m1 is W1(100 * 50 tensor) and b1(50 * 1 tensor).

So if I take W1 as a Variable, and given C(100*50 tensor) and do something like:

```
B = m1(A)
D = W1+C
loss1 = loss_func1(D,target1)
loss2 = loss_func2(B,target2)
loss=loss1+loss2
loss.backward()
```

what is the gradient like for W1, given the fact it is the parameters of m1, not purely Variable? Anything special?

**alexis-jacq**(Alexis David Jacq) #8

Nothing special I suppose. For example with an MSELoss, the gradient of W1 should be something like (ignoring dimensions) :

```
W.grad = d(l1)/d(W) + d(l2)/d(W) = 2 (W+C-T1) + 2 (W*A+b1-T2)*A
```

**alexis-jacq**(Alexis David Jacq) #9

You can even check:

```
M = nn.Linear(100,50)
W = M.weight
criterion = nn.MSELoss(M.parameters())
C = Variable(torch.rand(50,100),requires_grad=True)
A = Variable(torch.rand(4,100),requires_grad=True)
T1 = Variable(torch.rand(50,100))
T2 = Variable(torch.rand(4,50))
B = M(A)
D = W + C
l1 = torch.sum((D-T1)**2)
l2 = torch.sum((B-T2)**2)
l = l1 + l2
l.backward()
x = 2*(D - T1)
y = (B - T2).transpose(0,1).unsqueeze(0)
z = A.unsqueeze(0)
t = 2*torch.bmm(y,z).squeeze()
grad_test = x+t # 2 (W+C-T1) + 2 (W*A+b1-T2)*A
print(torch.sum((W.grad-grad_test)**2))
```

Variable containing:

0

[torch.FloatTensor of size 1]

**brisker**#14

how about this operation:

```
v = torch.autograd.Variable(torch.Tensor([0.8,0.1,0.1,0]), requires_grad=True)
out = (v==torch.max(v).expand_as(v)).float()
```

does variable v and out have gradients?

**SimonW**(Simon Wang) #15

Isnâ€™t that the same thing? The `expand_as`

call doesnâ€™t do anything here.

And also, argmax operation is never differentiable.

**brisker**#16

thanks for this post:here

and the code you provided below:

```
M = nn.Linear(100,50)
W = M.weight
criterion = nn.MSELoss(M.parameters())
C = Variable(torch.rand(50,100),requires_grad=True)
A = Variable(torch.rand(4,100),requires_grad=True)
T1 = Variable(torch.rand(50,100))
T2 = Variable(torch.rand(4,50))
B = M(A)
D = W + C
l1 = torch.sum((D-T1)**2)
l2 = torch.sum((B-T2)**2)
l = l1 + l2
l.backward()
x = 2*(D - T1)
y = (B - T2).transpose(0,1).unsqueeze(0)
z = A.unsqueeze(0)
t = 2*torch.bmm(y,z).squeeze()
grad_test = x+t # 2 (W+C-T1) + 2 (W*A+b1-T2)*A
print(torch.sum((W.grad-grad_test)**2))
```

Variable containing:

0

[torch.FloatTensor of size 1]

besides, if I want to **select** **(slicing)** one column of W1(100,50) matrix: W1_c(100 * 1 Variable), and obviously converting W1 to W1_c needs a (50 * 1) Variable, with only one 1-value and 49 0-value,right? (like [0,0,0,0,1,0,0,â€¦,0,0])Letâ€™s call this Variable Select_W. So what if this Select_W is created by something like:

```
v = torch.autograd.Variable(torch.Tensor([0.8,0.1,0.1,0.1,......,0.3]), requires_grad=True)
Select_W = (v==torch.max(v).expand_as(v)).float()
More_0 = W1 * Select_W
loss3 = loss_func3(More_0,target3)
```

So what is the gradients flow like among Variable Select_W, More_0,Variable v and the Variable A, B,etc? No gradients for Select_W Variable? Variable that created by slicing operation, does not have gradients? But this computational graph actually can be trained, I think. But I am confused by the gradient flow.

@SimonW

@alexis-jacq

**alexis-jacq**(Alexis David Jacq) #17

You can check that `out.requires_grad=False`

. @SimonW is right by notifying argmax (but here itâ€™s rather an indicator function than the true â€śargmaxâ€ť) is not differentiable. However, itâ€™s possible to set the derivative of the function as `f'(x)=0 for all x`

.

For instance,

`f(x,t) = Softmax(x*t) --> argmax(x)`

when `t --> +inf`

and respectivly, `f'(x,t) --> 0`

One simple solution could be:

```
v = Variable(torch.Tensor([0.8,0.1,0.1,0]), requires_grad=True)
y = F.softmax(v*100)
```