I am trying to separate out forward & backward passes of
nn.module as tensor operations (I am working on a peculiar requirement here )
The interfaces look like this.
class LinearForward: def __init__(self, in_features: int, out_features: int, bias=True, device=torch.device("cpu")): ... def __call__(self, input: Tensor) -> Tensor: self.input = input output = input.matmul(self.weight.t()) if self.bias is not None: output += self.bias return output class LinearBackward: def __init__(self, forward: LinearForward, device=torch.device("cpu")): ... def __call__(self, grad_output: Tensor) -> Tuple[Tensor, Tensor, Any]: grad_input = grad_output.matmul(self.weight) grad_weight = grad_output.t().matmul(self.forward.get_input()) grad_bias = None if self.bias is not None: grad_bias = grad_output.sum(0) return grad_input, grad_weight, grad_bias
And I have tested the implementation against
nn.Linear implementations and observed the following in cuda devices.
- my tensor implementation is about 10-15% slower than the
- In both cases, almost all the time is spent on the backward pass, while the forward takes <1% of the training time (at least for the Linear layer)
What would be the reason for this? Is it because my forward and backward classes are written in python and each
__call__ method this getting synchronized with the python GIL (Like in the forward and backward hook implementations in
Look forward to hearing from the community.
I have followed these threads previously.