Heterogeneous GPUs but same computation time

hpc-unex · June 20, 2019, 8:03am

Hello,

I’m using a RTX 2080 ti and a GTX 1050 ti in a two node cluster using pytorch. The problem cames when I execute it (distributed) but both of them take the same time solving MNIST. There are no sync points. Can anyone help me?

hpc-unex · June 20, 2019, 8:07am

My cuda version is 10.0 in RTX and 9.2 in GTX. Im using pytorch 1.2 with mpi 3.1

ptrblck · June 22, 2019, 7:57pm

Are you using DDP?
If so, the slower card might sync the faster one.

Or are you profiling the cards separately? If so, your code might have other bottlenecks (e.g. data loading). Have you profiled it?

pietern · June 24, 2019, 4:41am

We don’t guarantee compatibility between different versions of PyTorch. You say you have one version compiled against CUDA 10 and another against CUDA 9.2. This might work, but YMMV.

hpc-unex · June 24, 2019, 11:09am

Yes, im using DPP, but im using asynchronous all_reduce to average gradients, so theres no synchronization if I dont make explicit a.wait() (which im not doing just to test), right? In that case, training times still the same which makes no sense for me. Am I losing anything?