Question about autograd and leaf tensors

I have always been confused about this behaviour

Let us say I am implementing an autoencoder. My source and target are going to be same, let us say I want to reconstruct some vector x.

So something like
z = Encoder(x)
y = Decoder(z)

  1. loss = LOSS(y, x)

  2. loss = LOSS(y, x.clone())

  3. loss = LOSS(y, x.clone().detach())

What is the difference in behavior in each of the cases, x is of course a leaf tensor, but I don’t understand the gradients aspect is it 1) going to form a cycle ?

It shouldn’t be a problem, as usually x won’t require gradients in an autoencoder setup.
The first two approaches should yield the same result, while the last one should be different.
Here is a small dummy code:

x = torch.randn(1, requires_grad=True)
param = nn.Parameter(torch.randn(1))
output = x * param

# 1.
loss = F.mse_loss(output, x)
loss.backward()
print(param.grad)
print(x.grad)

x.grad.zero_()
param.grad.zero_()

# 2.
output = x * param
loss = F.mse_loss(output, x.clone())
loss.backward()
print(param.grad)
print(x.grad)

x.grad.zero_()
param.grad.zero_()

# 3.
output = x * param
loss = F.mse_loss(output, x.clone().detach())
loss.backward()
print(param.grad)
print(x.grad)

> tensor([-3.3104])
tensor([0.6608])

> tensor([-3.3104])
tensor([0.6608])

> tensor([-3.3104])
tensor([-0.9747])

Thanks this answers my question. But since x is a leaf variable, it’s grad value serves no purpose right?

Usually you don’t need gradients in your input tensor.
However there are some use cases, e.g. in adversarial training, where these gradients can be used for small perturbation of the input.

It’s an old article, but I’m asking because it’s similar to my situation!

Is the last case wrong?
I’ve only seen the first case.

I wouldn’t claim it’s wrong as it depends on your use case.

1 Like