If I have:
self.layer1 = torch.nn.Conv1d(in_channels=512, out_channels=512, kernel_size=1)
isn’t that equivalent to
self.layer1 = torch.nn.Linear(512, 512)
?
If I have:
self.layer1 = torch.nn.Conv1d(in_channels=512, out_channels=512, kernel_size=1)
isn’t that equivalent to
self.layer1 = torch.nn.Linear(512, 512)
?
Yes, should be the case:
# Setup
conv = torch.nn.Conv1d(in_channels=512, out_channels=512, kernel_size=1).double()
lin = torch.nn.Linear(512, 512).double()
# use same param values
with torch.no_grad():
lin.weight = nn.Parameter(conv.weight.squeeze(2))
lin.bias = nn.Parameter(conv.bias)
# forward
x = torch.randn(2, 512, 20).double()
out_conv = conv(x)
# permute for linear
x_lin = x.permute(0, 2, 1)
out_lin = lin(x_lin)
# check forward output
print(torch.allclose(out_lin.permute(0, 2, 1), out_conv))
> True
print((out_lin.permute(0, 2, 1) - out_conv).abs().max())
> tensor(1.2212e-15, dtype=torch.float64, grad_fn=<MaxBackward1>)
# check backward
out_conv.mean().backward()
out_lin.mean().backward()
print(torch.allclose(conv.weight.grad.squeeze(2), lin.weight.grad))
> True
print(torch.allclose(conv.bias.grad, lin.bias.grad))
> True
Thanks so much. So there’s literally no difference, not even in terms of computation?
There is most likely a difference in computation in particular if you are using CUDA operations. E.g. convolutions would be dispatched to cudnn, if you are using an NVIDIA GPU, which could internally call into cublas (same as in the linear layer), but isn’t guaranteed.
I don’t know, which methods are exactly called on the CPU.
For my code snippet the convolution would use cudnn::cnn::implicit_convolve_dgemm
, while the linear layer would call into volta_dgemm_128x64_tn
.