How to Set Learning Rate and Batch Size in Data Parallel

Hi,

I am using data-parallel across two GPUs.

How to set my batch size and learning rate, as my loss in not decreasing, while in one gpu it decreased?

Are there any rules of thumb to do this?

Specifically, learning rate and batch size.

Also, my second GPU is not used probably:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.88       Driver Version: 418.88       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:26:00.0 Off |                  N/A |
| 27%   62C    P2    76W / 280W |   9523MiB / 11175MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 00000000:27:00.0 Off |                  N/A |
|  0%   31C    P8    11W / 280W |     10MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     24     C   python                                      9513MiB |
+-----------------------------------------------------------------------------+

How to make sure the second gpu is also used? I am wrapping my model in data parallel.