What is the specific reason to get different results on cuda and without cuda?

Can I get a straight answer why this is happening on a simple MLPClassifer?
Is it possible even? I am using the same seed same data same method.


I don’t know how you are comparing the models and outputs as you haven’t shared any code, but note that the random number generators are not the same on CPU and the GPU, so seeing won’t create the same values.
If you want to compare the outputs of different devices, store and load a state_dict so that all parameters and buffers of the model will be equal.

1 Like

@ptrblck Would it be possible that, since Linear on GPU (Ampere) makes use of TensorFloat32, there are some discrepancies due to the smaller mantissa?

  • GPU: 10 bits of mantissa with TensorFloat32 + FMA in float32
  • CPU: 23 bits of mantissa + FMA in float64 before rounding to float32

The computation is independent from the initialization, but yes if TF32 is enabled (note that it’s disabled in linear layers starting in 1.12.0) the output would be different. However, even in float32 (or any other dtype) you would expect to see numerical errors according to the limited floating point precision as most likely different algorithms are used.

I am not sure if that’s always the case, but in most scenarios the data are generated on CPU and then copied to the GPU using .to("cuda:0"). In that case, when using the same seed, you should have the same data. Therefore, the discrepancy would come either from the quantization error due to the smaller mantissa and/or a different algorithm used in the computation (I think stochastic algorithms are used by default cuBLAS :: CUDA Toolkit Documentation + CPU & GPU might not use both FMA).

Please @ptrblck correct me if I am wrong, by reading your answers I learned really a lot in the past years.

Yes, your description look correct. If you sample all data (and parameters) on the CPU and push it to the GPU, you would expect to reuse the same values. However, even then the pseudorandom number generators will use different implementations and e.g. dropout layers would behave differently (the CUDA implementation has a vectorized code path for example).
Also you are correct that different algorithms could create different small errors due to the limited floating point precision.