Hi,
I am using data-parallel across two GPUs.
How to set my batch size and learning rate, as my loss in not decreasing, while in one gpu it decreased?
Are there any rules of thumb to do this?
Specifically, learning rate and batch size.
Also, my second GPU is not used probably:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.88 Driver Version: 418.88 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... On | 00000000:26:00.0 Off | N/A |
| 27% 62C P2 76W / 280W | 9523MiB / 11175MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... On | 00000000:27:00.0 Off | N/A |
| 0% 31C P8 11W / 280W | 10MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 24 C python 9513MiB |
+-----------------------------------------------------------------------------+
How to make sure the second gpu is also used? I am wrapping my model in data parallel.