I’m implementing distributed evaluation using DistributedDataParallel and so far, with the help of this forum, it works quite well
However, I noticed that when running it on two GPUs (Titan V) the second gpu is quite a bit slower than the first one.
The first GPU does around 4 it/s whereas the second GPU only does 3 it/s . I was wondering if there is a problem with my implementation, so here is the the code:
Do you see any obvious flaws?
Both GPUs are using roughly the same amount of memory:
I also noticed that the GPU 1 most of the time uses less power then GPU 0.
Any explanation for performance loss on the second GPU?
Best,
Thorsten
PS:
I’m using pytorch 2.1.0.dev20230719 because of the problems written here.
I don’t know what kind of setup you are using, but did you profile the devices in other applications (i.e. not in PyTorch)? E.g. is the PCIe bandwidth the same for all devices or is GPU1 using less lanes etc.?