How is a Conv1d with groups=1 different from a Linear layer?

If I have:

self.layer1 = torch.nn.Conv1d(in_channels=512, out_channels=512, kernel_size=1)

isn’t that equivalent to

self.layer1 = torch.nn.Linear(512, 512)

?

Yes, should be the case:

# Setup
conv = torch.nn.Conv1d(in_channels=512, out_channels=512, kernel_size=1).double()
lin = torch.nn.Linear(512, 512).double()

# use same param values
with torch.no_grad():
    lin.weight = nn.Parameter(conv.weight.squeeze(2))
    lin.bias = nn.Parameter(conv.bias)

# forward
x = torch.randn(2, 512, 20).double()
out_conv = conv(x)

# permute for linear
x_lin = x.permute(0, 2, 1)
out_lin = lin(x_lin)

# check forward output
print(torch.allclose(out_lin.permute(0, 2, 1), out_conv))
> True

print((out_lin.permute(0, 2, 1) - out_conv).abs().max())
> tensor(1.2212e-15, dtype=torch.float64, grad_fn=<MaxBackward1>)

# check backward
out_conv.mean().backward()
out_lin.mean().backward()
print(torch.allclose(conv.weight.grad.squeeze(2), lin.weight.grad))
> True
print(torch.allclose(conv.bias.grad, lin.bias.grad))
> True
1 Like

Thanks so much. So there’s literally no difference, not even in terms of computation?

There is most likely a difference in computation in particular if you are using CUDA operations. E.g. convolutions would be dispatched to cudnn, if you are using an NVIDIA GPU, which could internally call into cublas (same as in the linear layer), but isn’t guaranteed.
I don’t know, which methods are exactly called on the CPU.

For my code snippet the convolution would use cudnn::cnn::implicit_convolve_dgemm, while the linear layer would call into volta_dgemm_128x64_tn.

1 Like