Improving precision error in model parallel setting

Code:

model0 = nn.Linear(10, 10)
model1 = nn.Linear(10, 10)
model2 = nn.Linear(10, 10)
model3 = nn.Linear(10, 10)
input = torch.rand(128 ,10)
output = model3(model2(model1(model0(input))))
model0.to('cuda:0')
model1.to('cuda:1')
model2.to('cuda:2')
model3.to('cuda:3')
pred = model3(model2(model1(model0(input.to('cuda:0')).to('cuda:1')).to('cuda:2')).to('cuda:3')).cpu()
assert torch.allclose(output, pred) # Return False

How to fix this test? The maximum difference between output and pred is just 6.3330e-08 so that the implementation is not wrong but some system issue, I think.

Is there any possible option to make this thing right?

I don’t want to lose precision accuracy so that adding atoi=1e-7 to torch.allclose is not I want.

Thanks,

This small difference is most likely a result of the limited floating point precision, which can be seen e.g. by changing the order of operations:

x = torch.randn(10, 10, 10)
sum1 = x.sum()
sum2 = x.sum(0).sum(0).sum(0)

print(sum1 - sum2)
> tensor(-3.8147e-06)

Avoiding these differences is especially hard (or impossible) using different hardware.
What is your current setup, i.e. which GPUs are you using?

Wow… your testcase makes me creepy…
Thanks, though. I’m using 4 Titan Xp and I just increase the atol to 1e-7 to make my testcase work.
Is there any technical paper that the cascaded precision error incurs the performance degeneration in machine learning?

Not that I’m aware of and usually you should consider in your calculations the limited floating point precision. I.e. if you current use case needs more precision, you would have to use float64, which increases the precision (but is still limited).

1 Like