With one GPU and a batch size of 14 an epoch on my data set takes about 24 minutes. With 2 GPUs and a batch size of 28 it’s still taking 24 minutes per epoch. Any suggestions on what might be going wrong? Does the batch normalization layer try to normalize across both GPUs and thus add large amounts of extra memory traffic? Please say it doesn’t.
Thanks.
Top shows 2 CPUs saturated:
Tasks: 255 total, 1 running, 254 sleeping, 0 stopped, 0 zombie
%Cpu(s): 16.3 us, 2.5 sy, 0.1 ni, 81.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 65885928 total, 40001592 free, 11878640 used, 14005696 buff/cache
KiB Swap: 67017724 total, 67017724 free, 0 used. 52840116 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4622 mmacy 20 0 49.225g 5.809g 977604 S 200.0 9.2 111:32.30 work/vnet.base.
The memory allocation on the two GPUs is also uneven. If they’re both doing the same operations with the same batch size, why is GPU1 using 1/3rd more memory than GPU0?
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57 Driver Version: 367.57 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 0000:01:00.0 On | N/A |
| 51% 82C P2 74W / 250W | 7906MiB / 12186MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN X (Pascal) Off | 0000:02:00.0 Off | N/A |
| 47% 78C P2 107W / 250W | 10326MiB / 12189MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1086 G /usr/lib/xorg/Xorg 105MiB |
| 0 8469 C work/vnet.base.20170316_0434 7797MiB |
| 1 8469 C work/vnet.base.20170316_0434 10323MiB |
+-----------------------------------------------------------------------------+
I also see that parallel_apply in data_parallel relies on python threading, which isn’t very worthwhile given how much of the code has to run under the GIL. The only way to get any sort of reasonable parallelism while using regular python GIL protected code is to run separate python processes.
Are other people actually seeing a speedup from DataParallel? I think I’m probably only seeing one thread make progress at a time.