Super slow training in ResNet18 in V100 GPU

farhanrahman · June 18, 2020, 4:01am

Hello,

I have been training a Resnet18 network on synthetic dataset of about 350,000 images. Each of the image has a label, all the labels are stored in a text file. I have created a simple custom Dataset class that loads one image at a time inside the getitem function. The label file is just read once in the init function and converted to a tensor (since its small).

The trouble is one epoch would take me about 45+ minutes in a system with a V100 GPU. As far as I am concerned, is it not a bit too much? Can it be due to a slow dataloading? Please share your opinions on this.

Thanks

dhananjayraut · June 18, 2020, 4:52am

your speed is 129 images/sec which is lower than I expect see https://developer.nvidia.com/deep-learning-performance-training-inference the pytorch resnet50 model does inference at 905 images/sec on a single V100. I will try increasing batch size as resnet18 is fairly small mode for a V100 and number of workers for dataloader.

farhanrahman · June 18, 2020, 5:41am

Thanks Dhananjay for your time to reply! Those were my intuitions as well. I tried playing with the num_workers parameters as well, however playing around with it does not seem to change much so I have kept it between 8/16 or 32. But as I said, it did not seem to matter much.

My batch size is already 64, I do not want to make it larger than that in case that makes it harder for the network to generalize (As far as I understand). Is there anything else that can solve this issue for me? Please note that the images are loaded inside the get_item function inside the Dataset class.