Hello, I’ve just installed a new machine with two 2080ti.
Comparing with my another machine with two 1080ti (all other components like CPU and mainboard are identical), my codes are 50% slower on 2080ti. Configurations of the machines are:
BTW, it seems to be due to the overheating of my device 0, which is too closely located to device 1.
DataParallel makes the device 1 wait for the device 0 of which power is reduced.
When I test the device 1 only, it seems to be faster than 1080ti single.
Ah, I had to change the mainboard that supports 4 pcie 16x slots.
Then I put the cards to maximize space between them: [device0] [empty] [device1] [empty].
Then it seems to be ok.
When I installed like [device0][device1], the device0 became too hot (within 5 minutes) and throttled down.
It was ok when I used 1080. Similar slow-down for 1080ti (although I was not aware because it was not as clear and frequent as 2080ti).
However I am seeing similar slowdowns with 2x 2080Ti using data.parallel on Cuda 9.2. Single card 2080Ti performance is faster than a single 1080Ti. In my case it does not to appear to be a heat issue, as I have two blower 2080Tis with temps always <68c.
I found that when using an architecture that is more rare, it is really important to set torch.backends.cudnn.benchmark = True
This made training run 5x faster in my older Maxwell Titan and it could also apply to the Turing cards.