Hey,
I found that a batch_size = 32 during training is a good value for updating the model, but it largely underuses the available RAM in my GPU, as seen in this nvidia smi output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 511.79 Driver Version: 511.79 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 70C P8 3W / N/A | 1284MiB / 4096MiB | 18% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Just about 30% of the memory is used. Also, if I understand correctly, it seems just 18% of the available GPU power are used? This seems like a batch_size = 32 is very inefficient in terms of training time. Nevertheless, increasing the batch_size much further would also not be good. Are there typical things one can do in this situation to increase the efficiency, like some form of parallelization, etc?
Thanks!
Best, JZ