Num_workers dead threads

Hello, I have faced a problem with num_workers while training my models. I have made a simple transfer learning task with densenet121 and CIFAR10 resized to the ImageNet resolution of 224:

class DenseNet(nn.Module):
    def __init__(self):
        super(DenseNet, self).__init__()
        
        # densenet
        self.densenet = densenet121(pretrained=False)
        
        # classifier
        self.densenet.classifier = nn.Linear(in_features=self.densenet.classifier.in_features, out_features=10)
        
    def forward(self, x):
        
        return self.densenet(x)

Transform and the data loader are defined as follows:

# transform
transform = transforms.Compose([transforms.Resize((224, 224)), 
                                transforms.ToTensor()])

# define the training data loader
training_loader = DataLoader(dataset=cifar_dataset, batch_size=64, shuffle=True, pin_memory=True, num_workers=8)

So, I use 8 workers on the host to feed the data to the device. The model works just fine, however, I have noticed that not all workers do something (in fact, all 8 don’t seem to do anything), while the main thread is the one that loads the data (loads up to 10% though):

So, it looks like this creates a bottleneck and the GPU isn’t utilized for more than 15%:
dead_gpu

If anyone knows what is going on, please advise.

Specs:

  • CPU: AMD 2700X
  • GPU: GTX 1080Ti
  • The dataset is on SSD
  • CUDA: 10.0
  • cuDNN: 7.42

Hi,

Given that your disk usage is very low, your workers seems to have very little to do anyway.
It’s weird that both cpu and GPU usage are very low (and steady). Do you work with very small net/inputs?

Hi, I built a simple transfer learning task and the network I am using is the DenseNet121 which is not pretrained. Also I use the CIFAR10 which is resized to the resolution of the ImageNet (224, 224) in the data loader. I assumed that was enough complexity to test out the num_workers in the dataloader.

i think that looks fine, the task is so easy (CIFAR is so incredibly small) that it doesn’t even really utilize the one CPU. I don’t think this creates a bigger bottleneck than spreading it across workers.

Also, the resizing op is so cheap that it is maybe too fast to draw any conclusions from that. Could you run this on a larger dataset? You could use UTKFace or CACD or sth that is simple to download. Or, you could simply resize the CIFAR images and save them in higher res to disk (maybe 500x500). I think the data loading of the image and reading it into the tensor would be the main task where I would suspect bottlenecks.

In any case, another thing to try (which probably doesn’t make a difference) is not to use pinned memory (btw. it always made things substantially slower than not using it; don’t necessarily expect a correlation with your issue, but maybe just a simple thing to try).

Isn’t pin_memory supposed to speed up CUDA training since the tensors are loaded into page-locked memory on the host?

Isn’t pin_memory supposed to speed up CUDA training since the tensors are loaded into page-locked memory on the host?

Yes it is, just couldn’t reproduce this in practice, yet :(. In my experience, I got up to 2x slower traininig. E.g., see https://github.com/rasbt/deep-learning-book/blob/master/code/model_zoo/pytorch_ipynb/convnet-resnet34-cifar10-pinmem.ipynb Maybe I am doing something wrong (I also tried this as separate script.py files, and i tried a different machine and different models, always approx. the same behavior)

That is why I am trying to figure this whole multithreading loader question once and for all. I found out that a lot of other practitioners have the same issues but I couldn’t find a reasonable solution yet.

That is why I am trying to figure this whole multithreading loader question once and for all. I found out that a lot of other practitioners have the same issues but I couldn’t find a reasonable solution yet.

So your observation (quoted below) only applies to when the memory is pinned? What do you get when you set it to False?

Exactly same performance with pin_memory=False:confused: In fact, I have the same problem with the model we are building at work, which is a massive variational autoencoder with image resolution of 1024x1024. The setup on that machine is Threadripper 2950X and 2x TRX2080Ti.

So, it looks like this creates a bottleneck and the GPU isn’t utilized for more than 15%:

I think I misinterpreted your figure before:

The second last row is the GPU process? I think this may look like expected then. You can see that 5 workers are doing sth but the task is not very demanding so that the remaining 3 out of 8 workers may not be really used. They all use ~200 Mb of memory, so the 3 workers may just be finished at the time you took that screenshot and are waiting for the remaining 5 to finish. The bottleneck is likely IO since you have only ~0.4% CPU usage.

The 12.4% GPU usage looks also normal maybe, because you have a very small model (2.660 Mb) and there’s probably not much to do for the GPU in terms of computation compared to other tasks that need to be performed during the forward & backward passes like updating the weights and so forth?

Another thing to do is maybe that you have too many workers and there is a communication-based bottleneck between the processes. Maybe reduce the number of workers temporarily to 1 and 2 and see if you can better GPU utilization. I usually use 2 or 4 processes max and always get like 99% utilization for relatively standard models like ResNet 34 etc.

Thank you, I will try to do that; meanwhile could you tell me an example of a network to get 99% utilization? I would like to reproduce the results on my machine.

Same performance with 2 workers.

Hm, do you maybe have a very slow disk, e.g., a regular spinning hard drive?

Maybe try one of my toy example notebooks. I just ran those and get ~82% utilization with ResNet-50 on MNIST and 99% with VGG-16 on Cifar

29%20PM

(you wouldn’t need anything extra to run them).

Thanks, going to run now. My data is on SSD, so I don’t think there is a bottleneck with it.

Just realize that I ran these without specifying the number of workers. However, I just set them to 4 in each case and get about the same:

21%20PM

Ok it seems to work judging by the nvidia-smi:

The default process monitor is lagging then I suppose since this is what it shows:

Do you happen to know what are the breaks in the GPU processing? GPU doesn’t do anything for a little and then starts working again (works in cycles).

Ok great. Maybe try nvidia-smi based on the original model you were using. I think nvidia-smi is more or less accurate, because the utilization almost perfectly correlates with the fan speed, which correlates with perceived noise :stuck_out_tongue:

Yep, it works, using 2 workers. Idk what the deal with the original process monitor :confused:
image

1 Like

Do you happen to know what are the breaks in the GPU processing? GPU doesn’t do anything for a little and then starts working again (works in cycles).

I guess updating the weight is probably not requiring much computing. Or other tasks that you may have in your network where you are copying values etc. Essentially everything that is not a large dot product or matrix multiplication.