I know that this topic has been discussed many times but among all the past posts I couldn’t find an exhaustive explanation.

I am trying to translate this Tensorflow code to PyTorch where model is a logistic regression.

```
for X, Y in train_data:
with tf.GradientTape() as tape:
X = X / 255.0
y_hat = self.logistic_regression(X, W, b)
one_hot = tf.one_hot(Y, 43)
loss = self.func.cross_entropy(y_hat, one_hot)
losses.append(tf.math.reduce_mean(loss))
grads = tape.gradient(loss, [W, b])
self.sgd([W, b], grads, lr, X.shape[0])
```

I know that the easiest way would be to create a class inheriting from nn.Module and use loss.backward() and optim.step(), nevertheless since I wanted to use torch.autograd.grad, I ended up with this torch script:

```
for X, Y in train_data:
X = X / 255
y_hat = self.logistic_regression(X, W, b)
one_hot = torch.nn.functional.one_hot(Y, 43).bool()
loss = self.func.cross_entropy(y_hat, one_hot)
losses.append(torch.mean(loss).item())
grads = torch.autograd.grad(loss, [W,b], grad_outputs=torch.ones_like(loss))
self.sgd([W, b], grads, lr, X.shape[0])
```

I noticed that tape.gradient() in TF expects the target (loss) to be multidimensional, while torch.autograd.grad by default expects a scalar. This difference as far as I understood can be overcame by adding the parameter grad_outputs=torch.ones_like(loss) to torch.autograd.grad.

The problem however, is that even though the two scripts that I have pasted above in TF and PT show be equivalent, the results are very different: the TF one converges rapidly while the one in PyTorch doesn’t and the while in TF the with 10 epochs the training loss goes from 3.97 to 1.32, in PyTorch starts from 3.76 and diminishes to 3.74.

Do you know why?

Thanks a lot