DataLoader workers killed often?

I’m wondering why this happens. If have say 12 workers running on collecting image data, then every 12 iterations, I see the training process pause for a good 30 seconds before continuing. Each iteration takes only 2 seconds or so. So it seems like the workers are getting killed and spawned often? Can someone explain what is happening?

I think the workers are not fast enough to provide the next batches, so that you have to wait for the data loading. I don’t think this issue is due to respawning the workers, but rather it seems you have an IO bottleneck.
You could try to play around with the number of workers or to speed up the data loading (e.g. move the data to an SSD if it’s stored on an HDD).

I see… Does it make a difference if I perform IO in the collate function instead of in the getitem function?

I don’t think so. Could you post the __getitem__ function so that we could have a look at possible bottlenecks in the processing?
PIL SIMD might be a good drop-in replacement for PIL, if you are applying some (heavy) image processing.
Also, you could have a look at NVIDIA’s apex - fast_collate implementation, which could speed up the loading.

so the getitem does only some list processing. gets a random subset of the list.

def __getitem__(self, idx):
       img_paths =[idx]
       img_paths  = self.get_reqd_paths(img_paths)

The collate function does all the heavy lifting:

def collate(<arguments>):
   buffer = [cv2.imread(path) for path in img_paths]
   buffer = torch_transform(buffer)

torch_transform is a composition - (normalization, image cropping and augmentation, convert to cpu tensor) - one of the augmentations is a cv2 rotate function, which I think might be the issue.

All of this is done on a numpy array. Maybe pin_memory =True might be a good idea?

Could you move the loading to __getitem__ and compare the time to load the data?
pin_memory might speed up the transfer from host to device, if you are using a GPU for your training.

Yeah, I’m using a GPU. I’ll see if I can do this in the next couple of days. Tad busy at the moment. Thanks anyhow! :slight_smile: