Improving precision error in model parallel setting

sh0416 · May 27, 2020, 3:11am

Code:

model0 = nn.Linear(10, 10)
model1 = nn.Linear(10, 10)
model2 = nn.Linear(10, 10)
model3 = nn.Linear(10, 10)
input = torch.rand(128 ,10)
output = model3(model2(model1(model0(input))))
model0.to('cuda:0')
model1.to('cuda:1')
model2.to('cuda:2')
model3.to('cuda:3')
pred = model3(model2(model1(model0(input.to('cuda:0')).to('cuda:1')).to('cuda:2')).to('cuda:3')).cpu()
assert torch.allclose(output, pred) # Return False

How to fix this test? The maximum difference between output and pred is just 6.3330e-08 so that the implementation is not wrong but some system issue, I think.

Is there any possible option to make this thing right?

I don’t want to lose precision accuracy so that adding atoi=1e-7 to torch.allclose is not I want.

Thanks,

ptrblck · May 27, 2020, 4:14am

This small difference is most likely a result of the limited floating point precision, which can be seen e.g. by changing the order of operations:

x = torch.randn(10, 10, 10)
sum1 = x.sum()
sum2 = x.sum(0).sum(0).sum(0)

print(sum1 - sum2)
> tensor(-3.8147e-06)

Avoiding these differences is especially hard (or impossible) using different hardware.
What is your current setup, i.e. which GPUs are you using?

sh0416 · May 27, 2020, 4:45am

Wow… your testcase makes me creepy…
Thanks, though. I’m using 4 Titan Xp and I just increase the atol to 1e-7 to make my testcase work.
Is there any technical paper that the cascaded precision error incurs the performance degeneration in machine learning?

ptrblck · May 27, 2020, 4:55am

Not that I’m aware of and usually you should consider in your calculations the limited floating point precision. I.e. if you current use case needs more precision, you would have to use float64, which increases the precision (but is still limited).