Hi folks
I am reimplementing this paper https://arxiv.org/abs/1909.01311 a method that allows backward and forward unlocking (i.e. you can compute the “gradients” of a layer as soon as you have executed the forward of such layer). This is a dummy implementation
class Linear(nn.Linear):
def __init__(self, in_features, out_features, bias=True):
super().__init__(in_features, out_features, bias)
def forward(self, x, y):
return super().forward(x), y
class ErrorFeedback(nn.Module):
def __init__(self):
super().__init__()
self.rm = None
def forward(self, x, y):
if self.rm is None:
device = x.device
self.rm = torch.randn(y.shape[1], x.shape[1]).to(device) / sqrt(x.shape[1])
x.backward(torch.mm(y, self.rm))
return x.detach(), y
class Sequentialy(nn.Sequential):
def __init__(self, *args):
super().__init__(*args)
def forward(self, x, y):
for module in self:
x, y = module(x, y)
return x
model = Sequentialy(OrderedDict([("lin1", Linear(input_features, 2000)), ("ef1", ErrorFeedback()),
("lin2", Linear(2000, 2000)), ("ef2", ErrorFeedback()),
("lin3", Linear(2000, 1000)), ("ef3", ErrorFeedback()),
("lin4", Linear(1000, n_classes))])).to(device)
My question is: how does it work with the GPU calls being async? Is the backward call inside each ErrorFeedback
layer executed at the same time as the forward of the following layer? If not, is there a way to do that?