Keeping small batch_size while using GPU memory more efficiently?

Hey,

I found that a batch_size = 32 during training is a good value for updating the model, but it largely underuses the available RAM in my GPU, as seen in this nvidia smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 511.79       Driver Version: 511.79       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   70C    P8     3W /  N/A |   1284MiB /  4096MiB |     18%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Just about 30% of the memory is used. Also, if I understand correctly, it seems just 18% of the available GPU power are used? This seems like a batch_size = 32 is very inefficient in terms of training time. Nevertheless, increasing the batch_size much further would also not be good. Are there typical things one can do in this situation to increase the efficiency, like some form of parallelization, etc?

Thanks!

Best, JZ