Hi,

I am trying to separate out forward & backward passes of `nn.module`

as tensor operations (I am working on a peculiar requirement here )

The interfaces look like this.

```
class LinearForward:
def __init__(self, in_features: int, out_features: int, bias=True, device=torch.device("cpu")):
...
def __call__(self, input: Tensor) -> Tensor:
self.input = input
output = input.matmul(self.weight.t())
if self.bias is not None:
output += self.bias
return output
class LinearBackward:
def __init__(self, forward: LinearForward, device=torch.device("cpu")):
...
def __call__(self, grad_output: Tensor) -> Tuple[Tensor, Tensor, Any]:
grad_input = grad_output.matmul(self.weight)
grad_weight = grad_output.t().matmul(self.forward.get_input())
grad_bias = None
if self.bias is not None:
grad_bias = grad_output.sum(0)
return grad_input, grad_weight, grad_bias
```

And I have tested the implementation against `nn.Linear`

implementations and observed the following in cuda devices.

- my tensor implementation is about 10-15% slower than the
`nn.Linear`

- In both cases, almost all the time is spent on the backward pass, while the forward takes <1% of the training time (at least for the Linear layer)

What would be the reason for this? Is it because my forward and backward classes are written in python and each `__call__`

method this getting synchronized with the python GIL (Like in the forward and backward hook implementations in `nn.Module`

s)?

Look forward to hearing from the community.

PS:

I have followed these threads previously.