Model training on some machines but not others

Hi there,

I have a model that trains on some machines but not others, I have tried running this model on 5 different machines,
machine 1: trained before, not anymore
machine 2: still trains
machine 3: has never trained properly
machine 4: still trains
machine 5: still trains

If I could I would just use machines 1,2,or 5, but machines 1 and 3 don’t have enough ram and machine 5 does not support cuda.

Only machine 4 is windows, all others are linux. I am running in a venv and have ensured that the code across all the machines is identical except where I need to remove multiprocessing from windows.

Edit, I have now found out that changing the device to CPU on machine 1 and 3 allows it to run properly, but I still don’t know why it wont work with cuda.

It may be because there is an unsupported GPU (e.g., older compute capability). Could you provide some more details about the setup on the machine that has a GPU but won’t work with CUDA?

The machine has a 1660 super

OK, can you provide some more details about the problem? Is it running out of CUDA memory (e.g., even with the minimum possible batch size)?

thats the issue im having, as far as I can tell, the model doesn’t run out of memory, there are no errors being shown, since I have 12 possible outputs(one hot matrix of 12), my default accuracy should be ~8.33%, which is what I’m getting even after 100 epochs when I say the model wont train. but it trains properly on the CPU of all machines and on the GPU of other machines.

After removing all the non determinism caused by the GPU, and printing the gradient after each loss on machine 2 and 3 on both CPU and GPU, I can see that the loss on all is identical, and the gradients on machine 2 CPU and GPU along with machine 3 CPU are identical. For some reason machine 3 GPU doesn’t have identical gradients.

Unless the hardware is identical across the machines, I’m not sure that it would be expected that the gradients would be bitwise identical. For example, there might be subtle differences in rounding because different architectures might have kernels that do arithmetic operations (such as reductions) with a different ordering.