Hi,
I just study a bit the code of the pytorch transformer module.
There, the Class TransformerEncoder is a stack of N TransformerEncoderLayers. So TransformerEncoder(encoder_layer, num_layers=1) with 1 Encoder Layer should give the same output as encoder_layer?
So, I would assume to get the same output for both ways:
encoder_layer = nn.TransformerEncoderLayer(d_model=6, nhead=2, batch_first=True)
src = torch.rand(2, 5, 6)
out = encoder_layer(src)
print(out, out.shape)
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=1)
out = transformer_encoder(src)
print(out, out.shape)
The output is always very different:
tensor([[[ 1.0687, -0.0628, -0.3718, -0.1207, 1.2548, -1.7682],
[ 0.0136, 0.2716, -0.6997, -0.2500, 1.9374, -1.2729],
[-0.6875, -0.3608, 0.7908, -0.0981, 1.7106, -1.3550],
[-1.6746, -0.1693, 0.2136, 0.8932, 1.3815, -0.6443],
[ 0.3343, -1.9192, 0.0965, 1.1378, 0.8232, -0.4726]],
[[-1.0233, 0.4987, 0.9853, 0.2254, 0.9698, -1.6559],
[-0.9034, -0.5267, 1.3985, 0.5957, 0.8218, -1.3858],
[ 0.9843, -1.5308, 0.2510, -0.5477, 1.4139, -0.5707],
[-0.1907, -0.0046, -1.3699, 1.2456, 1.2743, -0.9548],
[-0.6701, -0.6789, 0.4082, 0.9170, 1.4406, -1.4168]]],
grad_fn=<NativeLayerNormBackward0>) torch.Size([2, 5, 6])
tensor([[[-0.0860, -1.5762, -0.2599, 0.2899, 1.8222, -0.1900],
[-0.1013, -0.0221, -0.6439, -0.0912, 2.0414, -1.1828],
[-0.4802, -0.2750, 0.6492, -0.1569, 1.7459, -1.4830],
[-1.6553, -0.2178, 0.4153, 0.7626, 1.4004, -0.7052],
[ 0.2467, -2.0493, 0.3164, 0.9811, 0.7766, -0.2716]],
[[-1.3599, 0.1510, 0.9750, 0.0135, 1.3658, -1.1453],
[-1.0742, -0.2285, 1.3229, 0.5419, 0.8574, -1.4195],
[ 0.7563, -1.5332, 0.4498, -0.0397, 1.3679, -1.0011],
[-0.2690, -0.0986, -1.1021, 0.9688, 1.5993, -1.0985],
[-0.3656, -0.8214, 0.2097, 0.7191, 1.6452, -1.3870]]],
grad_fn=<NativeLayerNormBackward0>) torch.Size([2, 5, 6])