Decomposing a conv into two smaller convs doesn't have the same grad

I am trying to break down a conv2d into two smaller convs along the out_channels dimension.
The output of the smaller convs are concatenated along the out_channels dimension to create the original output.
I copy the weights from the original conv and try to compare the outputs and gradients. I observe that while the outputs are the same, the gradients have a slight error (around 2e-7). Although this error seems to be small, during a long training (e.g. Resnet18 on CIFAR-10) it causes the network to diverge.

Here’s the sample code for you to try. The printed diff between grads is not zero.

import torch
import torch.nn as nn

device = 'cuda'

c_out = 513
c_temp = 1
conv = nn.Conv2d(2, c_out, kernel_size=2, bias=False).to(device)
conv1 = nn.Conv2d(2, c_temp, kernel_size=2, bias=False).to(device)
conv2 = nn.Conv2d(2, c_out - c_temp, kernel_size=2, bias=False).to(device)

with torch.no_grad():
    conv1.weight[:] = conv.weight[:c_temp]
    conv2.weight[:] = conv.weight[c_temp:]

inp = torch.rand(1,2,3,3).to(device)
out  = conv(inp)
out1 = conv1(inp)
out2 = conv2(inp)

cat = torch.cat((out1, out2), dim=1)

grad = torch.ones(1,c_out,2,2).to(device)
cat.backward(grad)
out.backward(grad)
print(conv.weight.grad[:c_temp] - conv1.weight.grad)

Hi Mohammedreza!

I haven’t looked at your code, but I have a couple of comments.

This is the size of round-off error for single-precision numbers (torch.float32),
so this is to be expected.

By “diverge” do you mean that when training your original and composed
versions drift away from one another, with the difference between them
growing significantly larger, but each version training about as well as the
other (even though they differ)?

Or do you mean that your composed version fails to train stably, with weights
and / or gradients diverging off to inf?

If the former, it is to be expected, as the floating-point round-off error will
continue to accumulate as you train, so the two versions will differ from one
another. It’s not that one is right and the other is wrong – they just differ due
to accumulating round-off error and are arguably equally valid.

If the latter, you either have a bug somewhere, or your training is inherently
unstable, so that one version trains well, but an equivalent version that only
differs due to round-off error ends up running off to infinity.

You can check this by rerunning your test in double precision (torch.float64).
If due to round-off error, your initial deviation should drop to something on the
order of 1.e-15.

To check for stability issues, rerun your test multiple times with different
random initializations. If your training is on the edge of instability, sometimes
both versions will train successfully, sometimes they will both run off to infinity,
and sometimes one will succeed while the other fails. If, contrariwise, your
original version always trains successfully, while your composed version
always fails, you most likely have a bug in your decomposition.

Best.

K. Frank