Unfold + Linear + Fold != Conv2d

Hello,

This has appeared both on the forums and in PyTorch issues before (this one is still open). Nevertheless, I still find that when training a “simulated” Conv2d using Unfold, Transpose, Linear, Transpose, Fold, the gradients are different to using just the “equivalent” Conv2d.

For a reproducible example, I provide a “simulated” VGGxx implementation and loss curves from corresponding runs (all parameters being equal).

To get the simulated VGG, copy the contents of the torchvision implementation and modify make_features as follows:

def make_layers(cfg: List[Union[str, int]], batch_norm: bool = False) -> nn.Sequential:
    layers: List[nn.Module] = []
    in_channels = 3
    pooling_depth: int = 0
    for v in cfg:
        if v == "M":
            layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
            pooling_depth += 1
        else:
            v = cast(int, v)
            w, h = 224 // (2**pooling_depth), 224 // (2**pooling_depth)
            layers += [ nn.Unfold(kernel_size=3, padding=1, stride=1) ]
            layers += [ Transpose(-1, -2) ]
            layers += [ nn.Linear(in_channels * 3*3, v, bias=True) ]
            layers += [ nn.ReLU(inplace=True) ]
            layers += [ Transpose(-1, -2) ]
            # layers += [ Reshape(-1, v, w, h) ]
            layers += [ nn.Fold(output_size=(w, h), kernel_size=1, stride=1) ]
            
            in_channels = v
    return nn.Sequential(*layers)

The training harness used is the one of torchvision/classification, and I follow the recipe given in the README.

Here are the losses for the first few epochs of VGG11 reference (blue) and simulated (yellow):

Note that I also tried training both in double precision and it didn’t make a difference.

Am I missing something, or is there genuinely a bug in Unfold/Fold backward pass?

This post as well as the docs show an example code snippet showing the same results between a manual unfold approach and the corresponding conv layer. Did you check each layer separately making sure the outputs are equal up to floating point precision?

Yes, on a forward pass the outputs of the two approaches agree. It is the backward pass where they don’t seem to agree, and the issue I referenced gives a snippet where they show whether they agree or not is a bit random but does improve with increasing the float precision.

I seem to be experiencing the same issue when trying to modify the reference implementations of ResNet and AlexNet.

Moreover, if I use Conv2d for a few layers and Unfold+Linear for others, the model begins to converge, but much slower than if it were full Conv2d

Could you post a standalone layer showing the differences in the backward pass, please?