Is the type conversion differentiable?

I am using a mask based on comparison:

def combine_images(target, x, y):
    diff1 = torch.abs(target - x)
    diff2 = torch.abs(target - y)
    mask = (diff1 < diff2).float()
    target_new = x * mask + y * (1 - mask)
    return target_new

My question is that is this function differentiable since it includes a type conversion .float()? Thanks.


I just have a try on it with random tensors, it works.
In all these tensors, only mask with requires_grad=False.

Thanks. I have tried with

x = torch.tensor([0.0, 1.0], requires_grad=True)
y = torch.tensor([0.5, 0.5], requires_grad=True) 
mask = (x < y).float() 
z = (x * mask + y * (1 - mask) * 10).sum()

In [18]: x.grad                                                                 
Out[18]: tensor([1., 0.])

In [19]: y.grad                                                                 
Out[19]: tensor([ 0., 10.])

It seems it is working and the result looks reasonable for this trivial case. Not sure if it applies to general cases.

1 Like

Type conversions are “differentiable” as can be seen in this dummy mixed-precision example:

x = torch.randn(1, 10, dtype=torch.float16, device='cuda')
w1 = torch.randn(10, 1, requires_grad=True, dtype=torch.float16, device='cuda')
w2 = torch.randn(1, 1, requires_grad=True, dtype=torch.float32, device='cuda')

output = torch.matmul(x, w1)
output = output.float()
output = torch.matmul(output, w2)

loss = (output - torch.randn(1, 1, dtype=torch.float32, device='cuda'))**2


As you can see, I’m starting with float16 input and parameters, and convert them to float32 later, which just works fine, as Autograd just transforms the gradients back to the appropriate type.


For long it seems like the operation does not have a grad_fn. So is there any way to still use Autograd when the type conversion is to long?

1 Like

No, integral tensors cannot have gradient, only floating-point tensors can.
The underlying reason is that integral-valued functions are not mathematically meaningfully differentiable (ie they might be (locally) constant, but that’s it).


Right, I see! Thanks for the clarification.

Slightly off-topic question then - inside a training loss, I need to access the values of a tensor [y_true] by indices. The other tensor [y_pred] which consists of the indices, is of type float and has float values. Since I need to compute the gradient, is there any way to access values of y_true, without rounding y_pred (would like to avoid this due to its zero gradient output almost everywhere) and then doing the type conversion of it to long? Please note that it is not possible to do any interpolation in y_true in this context. A minimal example of the said loss function is as following -

def trainloss(y_pred, y_true): 
    #y_pred is of shape [100,2], y_true is of shape [64,64]
    idx = torch.round(y_pred)  
    idx = idx.long()
    loss = y_true[idx[:,0],idx[:,1]]
    loss = torch.max(loss)
return loss

We would likely want to think about what the derivative should be mathematically before we look at the implementation. What should happen to y_pred relative to loss? You write that no interpolation is possible, but so if y_pred had some gradient that causes the rounding to decrement idx by one from one step to the other, then you end up with y_true[old_idx[:, 0] - 1, old_idx[:, 1]. This would suggest that if that is smaller than that y_pred[:, 0] should have a positive gradient. (Leaving aside the batch thing…)

One variant where you try to avoid needing to differentiating y_pred is the REINFORCE algorithm in RL. Essentially, when y_pred is a sample from a probability distribution, you can still say the probability of y_pred should go up if the loss is small (“success”) or down if it is large (“failure”).

1 Like

Thanks for the reply. Sorry, I should have been more precise. By saying ’ it is not possible to do any interpolation in y_true', I meant that it is not possible to generate values of y_true at fractional indices by interpolation, such as y_true[3.5,4.5]. That is why the rounding is needed. So in a nutshell, I was wondering on how to compute the loss with a list (y_pred) of fractional indices. But I am guessing this is getting away from the topic of discussion in this post, I might make a separate post then.