Questions about torch.matmul function

Hi, currently I encountered a problem regarding to torch.matmul function. The linear operation in a neural network defined in functional module was output = input.matmul(weight.t()). I tried to play around this, and got confused.

First, I tried to modify it to output = input.detach().matmul(weight.t().detach()), and output = They both gave the wrong training results (i.e. loss didn’t get reduced).

I realized output generated in these ways have the requires_grad setting to false, however, even I manually set this to true using output.requires_grad_(True) after this multiplication, the training still produced the wrong result. Why did this happen?

Also, I noticed the output computed by the modified multiplication is a leaf variable, while the original version is not. Why it that?



This has nothing to do with matmul.
.detach() or .data are used to explicitly ask to break the computational graph, and thus prevent gradients from flowing. So if you add these in the middle of your network, no gradients will flow back and your network won’t be able to learn.

Thanks for your reply. In this case, if I want to use numpy to perform some computations in the middle of my network, is there a way to do it? I noticed the only way to convert a tensor to numpy in a network is to first detach it

If you use non-pytorch operations (like numpy) then the autograd engine will not work.
If you really need these operations, you will have to create a custom autograd.Function for which you have to implement both the forward and the backward pass. In this, you will get Tensors that do not require gradient and you can use numpy. But you will need to implement the backward pass yourself. You can read the doc here on how to do this.

1 Like

Problem solved! Thanks! Just to make sure, if I want to define some new operations using either Pytorch or numpy, I need to work on autograd.Function layer; when I want to add some features to the existing functions (like adding noise to the linear operations), working on nn.Module layer should be enough. Is it correct?

As long as you just want to change what your function outputs and you can implement everything with pytorch methods, then you can stick with nn.Module.
If it’s not implemented in pytorch or you want your backward pass not to represent the “true” derivative of your forward pass (to get wrong but smoother gradients for example), then you should work with autograd.Function.