Speeding up the training process and utilising the hardware optimally

bing · May 21, 2020, 9:02pm

Hi ,
I am trying to speed up the training process and use the resources optimally. My GPU utilization is low but my GPU memory is full, I am really not able to understand, where is the bottleneck and how to approach it.
Below are my system results -

I am training the 10k images which are resized from size 2048,2048 to size 512,512.
I want to train it further on larger resolution too but it is maxing out at 1024,1024. But, my GPU utilisation is really low.
Batch size is 20.
Is there any way to allocate less memory and maximise GPU utilisation.
Also, any other suggestions to optimise it is hugely appreciated.

Kushaj · May 22, 2020, 6:17pm

If GPU memory is getting full, then one reason might be, you are accumulating history somewhere. Like you do when you do total_loss += loss rather than doing total_loss += float(loss).

bing · May 22, 2020, 6:23pm

Oh, I never knew about it. Yes, I am doing it -
train_loss = train_loss + ((1 / (batch_idx + 1)) * (loss.data - train_loss))
and valid loss
valid_loss += valid_loss + ((1 / (batch_idx + 1)) * (loss.data - valid_loss)).
I’ll change it , are there any more such issues which results in accumulating history.
Thanks.

Kushaj · May 22, 2020, 6:25pm

Mainly this. And for the computation for which you do not want gradient use with torch.no_grad():.