Hi everyone, I am training my segmentation model using a image set of 8146 training images. Training was taking a long time so I created LMDB data first and then trained my model but I did not see any decrease in training speed. I am also using nn.DataParallel. Any idea how can I improve the training speed? Each epoch takes more than 20 minutes … I have experience with training models on tensorflow using tf.records and it was quite fast.
Can you see how much your GPUs are utilized using nvidia-smi. If they are not utilized fully it means the pre-processing (could be reading time from hard-drive) is taking much more time to utilize the GPU properly.
I dont think nvidia-smi gives a precise picture of GPU utilization. It just tells the GPU utilization at the very moment the model is.
Hmmm…you can use watch command
watch -n 0.1 nvidia-smi
or use dmon option
nvidia-smi dmon
will refresh the utilization every fraction of second or so