Volatile GPU Utilization in MultiGPU Training

Hi,

I am trying to train my model on 8 GPUS. Although I am able to use all the GPUs for training, but I can see that my Volatile GPU utilization is not perfect. It is fluctuating between all 0s to some % usage.

I tried to increase the number of workers in my dataloader. I also did pin_memory= True and then using async=True with volatile=False for training inputs.

But still the problem is existing.

1 Like

For models with lots of parameters in the classifier (nn.Linear layers at the end) the synchronization would have a large overhead. That’s why it’s often beneficial to use DataParallel on the feature part (conv layers) and compute the linear layers on one GPU.

Have a look at the ImageNet example, where this is done for AlexNet and VGG.

Could this be your issue?

1 Like

Thanks for the reply. I will try this.

I am using three FC layers :-1:

  1. nn.Linear(442048,512)
  2. nn.Linear(512,256)
  3. nn.Linear(256,128)

Also, if I remove FC layers, and do average pooling, you expect to always keep the volatile GPU utilization filled?

Did you try out the approach of splitting the model on GPU and CPU?
I’m not sure about replacing the linear layers with average pooling.
Does the model performance stay the same more or less?
What did you observe?

No, I haven’ tried yet. So in that case, you mean to perform linear layer compuations on a CPU or a single GPU?

Also, is there any suggestions on the number of workers which should be used. I am using 16 workers and have 10 GPUS. There can also be a possibility that GPU’s are not having continuous stream of data available all the time, if the multiprocessing queue speed is less than the batch computation.

I would try it on a CPU and see, if it’ll yield any speedup.

Sure that’s a valid assumption. You could try to find the bottlenecks in your code using the new torch.utils.bottleneck functionality.

It seems that it I don’t use my dataloader then even without parallelizing the Linear layers, the GPU utilization is full. In case of my dataloader I am loading the data from a pickle file in the constructor and then doing random sampling using torch.multinomial to generate triplet samples. Do you have any suggestions on it?
I have also enabled pin_memory.

I guess the problem looks to be in my dataloading. I tried to reduce the random sampling in that by doing it one time and saving in a pickle file. So now I have around 3 million samples in the pickle file. If I want to utilize 8 GPUs properly, I tried using a batch size of 200 and 16 queues, even then the problem persisted. So, I decreased the batch size to 32 for 8 GPUs, now the frequency for all GPU being utilized as 0 goes down. But at the same time, the percentage utilization is 10% because essentially I am giving 4 samples per GPU.

Basically, the multiprocessor is not efficient in keeping the batches ready for processing. Can you help me by sharing a dataloader which is implemented for multiGPU processing?

Could you post an example of your Dataset as well as the shape information of your data?

I guess the problem is in image decoding. CPU is always busy in that…
I saw this crop and resize library which I included (GitHub - longcw/RoIAlign.pytorch: RoIAlign & crop_and_resize for PyTorch)
But again my image is 1280x1920x3 so, that’s the block
In my pickle file, I store the image_loc_info,crop_info for a set of images

Then I use PIL to crop and resize

self.trainloader = torch.utils.data.DataLoader(self.trainset, batch_size=self.batch_size,shuffle=True, num_workers=20,pin_memory=True,collate_fn=self.trainset.collate_fn)

Iterate over the trainloader
inputs_1 = Variable(inputs.cuda(device=self.gpus[0],async=True),volatile=False)
…