Is VBIOS version significant to report for reproducibility?

Amos_Haviv_Hason · August 22, 2024, 12:07am

I’m trying to be as specific as possible in my report for reproducibility of the model I recently developed.

Across machines with different GPU models the results of training differ, understandably. But while on all our RTX 4090 machines the results are consistent, on our RTX 3090 machines there are two possible results.

After digging deeper, it seems that on our RTX 4090 machines, two VBIOS versions are possible - 95.02.3C.40.95 and 95.02.18.C0.75. However, the results are the same on both of them.

On our RTX 3090 machines, a number of VBIOS versions are possible - 94.02.42.??.?? and 94.02.26.??.??. Those machines with 94.02.42.??.?? produce the same result, and those with 94.02.26.??.?? produce the same (but different from the former) result.

I didn’t find any differences in the frequencies of the RTX 3090 cards though.

So my question is: May different VBIOS versions of the same GPU model lead to different training results? This is not the case on our RTX 4090 machines, but it might be the source of inconsistency on our RTX 3090 machines.

ptrblck · August 22, 2024, 12:50pm

I doubt it, but it would be interesting to see which kernel is causing the different results.

Amos_Haviv_Hason · August 22, 2024, 2:26pm

What do you mean by kernel?

ptrblck · August 22, 2024, 5:35pm

The PyTorch operation causing the non-determinism.