What are alternative implementation for tensor norm and interpolation?

JimW · April 3, 2022, 10:10pm

I am using our internal compiler to convert an existing Pytorch model. However, some operations are not supported on our side, so we need to use some alternative implementation for the same operations.

For example, the original pytorch implementation of one layer is like this:

x = x / x.norm(dim=-1, keepdim=True)

I changed it to

x = x / torch.sqrt(torch.sum(x**2, dim=-1, keepdim=True))

the original implementation of the finally layer (it is a segmentation network like unet) is like this:

x = F.interpolation(x, size=(256, 256), mode='bilinear')

and I changed it into multiple less aggressive interpolation and stacked them like this

x = F.interpolation(x, size=(64, 64), mode='bilinear')
x = F.interpolation(x, size=(128, 128), mode='bilinear')
x = F.interpolation(x, size=(256, 256), mode='bilinear')

It kind of worked because I can perform training and got some decent result. However, the new result seems to be slightly worse than earlier, and I am not sure if it is due to the above modifications.
Could someone please provide some comments whether the above modifications are correct? Thanks.

tom · April 4, 2022, 8:58am

I’ll be very blunt, sorry about this. Take all of it with a lot of salt because you’ll know your setup and challenges much better than I will.
That said, and in all honesty, I think you should ask yourself why you need these modifications. We all know that training networks can be finicky but at some point it stops being a sound engineering exercise. At the very least you should know what the exact difference in computation (both forward and backward) is. Sometimes there are differences and then one makes more sense than the other. But really, to evaluate the performance impact, you should probably do a stochastic trial (i.e. run both options with 10 or 50 or so different initializations and do the statistics to see if the difference you are seeing can be attributed to noise).
The first change might affect have some stabilization term the works differently for norm very close to 0. The second I don’t have that much intuition on, but it seems very strange, too.

Best regards

Thomas