Say we have two methods of slicing for computing an output of a convolutional layer:
# N - batch size
# CI - number of in channels
# CO - number of out channels
# T - sequence length (think of e.g. a time series 0, ..., t, ..., T)
conv = nn.Conv1d(CI,CO,kernel_size=1)
x = torch.rand(N,CI,T)
# method 1
out1 = conv( x[:,:,-5:] )
# method 2
out2 = conv( x )[:,:,-5:]
Theoretically, out1 and out2 should result in the same gradient updates for the conv kernel, but out1 requires less computation, as it only convolves the last 5 elements instead of the whole tensor x. In both cases, we cut away all elements but the 5 last ones, so that the gradients are not influenced by any but the last 5 points. Is that right? Or do the earlier points in the T-dimension of x also influence the gradient computation (for method 2)?
*) To be precise, the gradient computations are mathematically equivalent,
but do differ in (mathematically-equivalent) orders of the operations that
lead to differing floating-point round-off error. That’s why the equal() tests
for the gradients return False, but the allclose() tests return True.