What is the specific reason to get different results on cuda and without cuda?

krishna511 · July 11, 2022, 5:33am

Can I get a straight answer why this is happening on a simple MLPClassifer?
Is it possible even? I am using the same seed same data same method.

Regards

ptrblck · July 11, 2022, 6:01am

I don’t know how you are comparing the models and outputs as you haven’t shared any code, but note that the random number generators are not the same on CPU and the GPU, so seeing won’t create the same values.
If you want to compare the outputs of different devices, store and load a state_dict so that all parameters and buffers of the model will be equal.

ParGG · July 11, 2022, 1:00pm

@ptrblck Would it be possible that, since Linear on GPU (Ampere) makes use of TensorFloat32, there are some discrepancies due to the smaller mantissa?

GPU: 10 bits of mantissa with TensorFloat32 + FMA in float32
CPU: 23 bits of mantissa + FMA in float64 before rounding to float32

ptrblck · July 11, 2022, 4:34pm

The computation is independent from the initialization, but yes if TF32 is enabled (note that it’s disabled in linear layers starting in 1.12.0) the output would be different. However, even in float32 (or any other dtype) you would expect to see numerical errors according to the limited floating point precision as most likely different algorithms are used.

ParGG · July 12, 2022, 6:27pm

I am not sure if that’s always the case, but in most scenarios the data are generated on CPU and then copied to the GPU using .to("cuda:0"). In that case, when using the same seed, you should have the same data. Therefore, the discrepancy would come either from the quantization error due to the smaller mantissa and/or a different algorithm used in the computation (I think stochastic algorithms are used by default cuBLAS :: CUDA Toolkit Documentation + CPU & GPU might not use both FMA).

Please @ptrblck correct me if I am wrong, by reading your answers I learned really a lot in the past years.

ptrblck · July 12, 2022, 9:24pm

Yes, your description look correct. If you sample all data (and parameters) on the CPU and push it to the GPU, you would expect to reuse the same values. However, even then the pseudorandom number generators will use different implementations and e.g. dropout layers would behave differently (the CUDA implementation has a vectorized code path for example).
Also you are correct that different algorithms could create different small errors due to the limited floating point precision.