Forward and Backward calls in parallel in synthetic gradient method?

iacolippo · October 13, 2020, 12:46pm

Hi folks

I am reimplementing this paper https://arxiv.org/abs/1909.01311 a method that allows backward and forward unlocking (i.e. you can compute the “gradients” of a layer as soon as you have executed the forward of such layer). This is a dummy implementation

class Linear(nn.Linear):
    def __init__(self, in_features, out_features, bias=True):
        super().__init__(in_features, out_features, bias)

    def forward(self, x, y):
        return super().forward(x), y


class ErrorFeedback(nn.Module):
    def __init__(self):
        super().__init__()
        self.rm = None

    def forward(self, x, y):
        if self.rm is None:
            device = x.device
            self.rm = torch.randn(y.shape[1], x.shape[1]).to(device) / sqrt(x.shape[1])
        x.backward(torch.mm(y, self.rm))
        return x.detach(), y


class Sequentialy(nn.Sequential):
    def __init__(self, *args):
        super().__init__(*args)

    def forward(self, x, y):
        for module in self:
            x, y = module(x, y)
        return x

model = Sequentialy(OrderedDict([("lin1", Linear(input_features, 2000)), ("ef1", ErrorFeedback()),
                                 ("lin2", Linear(2000, 2000)), ("ef2", ErrorFeedback()),
                                 ("lin3", Linear(2000, 1000)), ("ef3", ErrorFeedback()),
                                 ("lin4", Linear(1000, n_classes))])).to(device)

My question is: how does it work with the GPU calls being async? Is the backward call inside each ErrorFeedback layer executed at the same time as the forward of the following layer? If not, is there a way to do that?

albanD · October 13, 2020, 2:32pm

Hi,

If there are no synchronization points (I don’t see one in your code), then yes it will be.