Gradient Tape in TF vs Autograd in PyTorch

I know that this topic has been discussed many times but among all the past posts I couldn’t find an exhaustive explanation.

I am trying to translate this Tensorflow code to PyTorch where model is a logistic regression.

    for X, Y in train_data:
    with tf.GradientTape() as tape:
        X = X / 255.0
        y_hat = self.logistic_regression(X, W, b)
        one_hot = tf.one_hot(Y, 43)
        loss = self.func.cross_entropy(y_hat, one_hot)
        losses.append(tf.math.reduce_mean(loss))
        grads = tape.gradient(loss, [W, b])
        self.sgd([W, b], grads, lr, X.shape[0])

I know that the easiest way would be to create a class inheriting from nn.Module and use loss.backward() and optim.step(), nevertheless since I wanted to use torch.autograd.grad, I ended up with this torch script:

            for X, Y in train_data:
                X = X / 255
                y_hat = self.logistic_regression(X, W, b)
                one_hot = torch.nn.functional.one_hot(Y, 43).bool()
                loss = self.func.cross_entropy(y_hat, one_hot)
                losses.append(torch.mean(loss).item())
                grads = torch.autograd.grad(loss, [W,b], grad_outputs=torch.ones_like(loss))
                self.sgd([W, b], grads, lr, X.shape[0])

I noticed that tape.gradient() in TF expects the target (loss) to be multidimensional, while torch.autograd.grad by default expects a scalar. This difference as far as I understood can be overcame by adding the parameter grad_outputs=torch.ones_like(loss) to torch.autograd.grad.
The problem however, is that even though the two scripts that I have pasted above in TF and PT show be equivalent, the results are very different: the TF one converges rapidly while the one in PyTorch doesn’t and the while in TF the with 10 epochs the training loss goes from 3.97 to 1.32, in PyTorch starts from 3.76 and diminishes to 3.74.

Do you know why?
Thanks a lot

Would you be able to share the script that you are working with?
The parts that you have shown looks fine to me.

Moreover, instead of looking at the loss reduction empirically, you could try to initialize the params, inputs to same values in both tensorflow and pytorch, & verify whether the loss, gradients are same.

Hello, TF_vs_PT/training.py at master · AlessandroMondin/TF_vs_PT · GitHub here is the script

In the project that you have shared, the data transformation of PyTorch was not equivalent compared to TensorFlow. Specifically, the image data range in TensorFlow was between [0,1] & in PyTorch, it was not in that range.

Thanks a lot, now works there’s consistency in the results between the libraries