I am observing different results when using `nn.Linear` on different GPU configurations with the same input data

wanxin · July 17, 2024, 12:46pm

Environment

PyTorch Version: 1.8.1+cu111
GPU: A30 mig12g and mig6g
Python Version: 3.8.8

Steps to Reproduce

Initialize a model with nn.Linear layer.
Perform a forward pass with the same input data on different GPU configurations.
Observe different output features.

what is the reason?

KFrank · July 17, 2024, 5:05pm

Hi Wanxin!

You can’t expect exact equality across different architectures (or versions,
gpu vs. cpu, batch sizes, etc.) Your results differ by amounts consistent
with floating-point round-off error, as is to be expected.

Best.

K. Frank

wanxin · July 18, 2024, 2:40am

Thank you for your response.
You are correct, but I still have a question. When I only use convolutional layers, this situation does not occur. Is it because the underlying implementation mechanisms of linear layers and convolutional layers are different?

KFrank · July 18, 2024, 4:30pm

Hi Wanxin!

Well, yes, strictly speaking, linear and convolutional layers are different.
But I don’t think that’s really the point.

I fully believe that on some sets of architectures, you will get round-off-error
discrepancies with convolutional layers, just as you saw with linear layers.
I would say that it happened to be the case that for your particular
convolutional layers and your particular gpu configurations, pytorch chose
to use the same orderings of floating-point operations and therefore no
discrepancy arose.

That is, what you saw was happenstance, rather than convolutional layers
having some special behavior.

Best.

K. Frank