RuntimeError: DataLoader worker (pid 27351) is killed by signal: Killed

alameer · August 3, 2020, 9:30am

I’m running the data loader below which applies a filter to a microscopy image prior to training. In order to count the red and green. This code filters the red cells. Since I have applied this to the code I keep on getting the error message above. I have tried increasing the memory allocation to the maximum allowance possible but that didn’t help. Is there a way I could modify the filter so it isn’t causing this issue, please? Many thanks in advance

import os

import numpy as np
import torch
from PIL import Image
from torch.utils.data import Dataset
from torchvision import transforms, utils
#from torchvision.transforms import Grayscalei
import pandas as pd
import pdb
import cv2

class CellsDataset(Dataset):
    # a very simple dataset

    def __init__(self, root_dir, transform=None, return_filenames=False):
        self.root = root_dir
        self.transform = transform
        self.return_filenames = return_filenames
        self.files = [os.path.join(self.root,filename) for filename in os.listdir(self.root)]
        self.files = [path for path in self.files
                      if os.path.isfile(path) and os.path.splitext(path)[1]=='.png']

    def __len__(self):
        return len(self.files)

    def __getitem__(self, idx):
        path = self.files[idx]        
        image = cv2.imread(path)
        sample = image.copy()
        # set blue and green channels to 0
        sample[:, :, 0] = 0
        sample[:, :, 1] = 0

channel.
        if self.transform:
            sample = self.transform(sample)

        if self.return_filenames:
            return sample, path
        else:

JuanFMontesinos · August 3, 2020, 9:56am

If you set num_workes=0 (this is turning off multiprocessing) do you get any error?

alameer · August 3, 2020, 10:01am

@JuanFMontesinos I haven’t tried doing this. I will give this a try and let you know. Thanks for your reply.

alameer · August 3, 2020, 11:58am

@JuanFMontesinos I have still got the same error message after changing the num_workes=0. Anything else to try?

JuanFMontesinos · August 3, 2020, 12:30pm

Hi,
In theory if you set num_workers=0, dataloader should run on the main thread. Therefore it’s a bit strange to get a worker error. Typical errors are lack of memory (which raises a MemoryError). Is there no extra info? (line in which the error is produced etcetera…) are you running the code in a server which may limit your resources? A killed signal seems like if the process were closed by the user/superuser out of the python pipe itself

alameer · August 3, 2020, 1:04pm

Yes. The code is running a server. This is the error message i am getting now:

slurmstepd: error: Detected 1 oom-kill event(s) in step 9708662.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Is there an alternative way I could change this code so it doesn’t copy the image each time. As this is likely causing the issue of running out of memory

    def __getitem__(self, idx):
        path = self.files[idx]        
        image = cv2.imread(path)
        sample = image.copy()
        # set blue and green channels to 0
        sample[:, :, 0] = 0
        sample[:, :, 1] = 0

anuragvij264 · August 3, 2020, 1:10pm

@alameer Just curious to know why would you need to copy the image? Can the filter put directly on image?

alameer · August 3, 2020, 1:32pm

Very good question. The reason i am copying the image because the example I am working from is copying the image and i thought i need to copy it before i filter it. If the code works without copying the image i think this would help. I will give this ago. Many Thanks for pointing this out.

JuanFMontesinos · August 3, 2020, 8:44pm

Soo slurm is the queue manager. I think the error happens because you are using more memory than the memory you requested. Try to request more memory (indeed not using the .copy() will be useful) or to reduce the batch size.

alameer · August 3, 2020, 8:57pm

Thanks for your reply! I have tried a smaller batch size and removed the copy() from the code but the i keep getting the same error message. Is there anything else you might suggest. I am running this code using a different CNN model it works okay but when i use the VGG model i get this error message.

JuanFMontesinos · August 3, 2020, 10:05pm

Soo it seems you are running out of memory in the GPU (are u using gpus?) can you try to use a gpu with more memory? or try to reduce even more the batch size.

alameer · August 3, 2020, 10:17pm

I’m not using GPUs. I think i need to re-size my image when using the VGG model as below. If i do so, i get the following error message. Are you able to advise what i need to change please?

RuntimeError: Calculated padded input size per channel: (1 x 1). Kernel size: (3 x 3). Kernel size can't be greater than actual input size

The code for my convnet file:

import torch
import torch.nn as nn

class Convnet(nn.Module):
    """
    A custom convnet for convolving 1080x1080 fluorescence micrograph images.
    """

    def __init__(self):
        super(Convnet,self).__init__()
        self.main = nn.Sequential(

            nn.Conv2d(3,16,10,stride=2,padding=0),
            nn.LeakyReLU(0.2,inplace=False),
            nn.Conv2d(16,32,7,stride=2,padding=0),
            nn.LeakyReLU(0.2,inplace=False),
            nn.Conv2d(32,64,3,stride=3,padding=0),
            nn.LeakyReLU(0.2,inplace=False),
            nn.Conv2d(64,96,3,stride=2,padding=0),
            nn.LeakyReLU(0.2,inplace=False),
            nn.Conv2d(96,128,3,stride=2,padding=0),
            nn.LeakyReLU(0.2,inplace=False),
            nn.Conv2d(128,192,3,stride=2,padding=0),
            nn.LeakyReLU(0.2,inplace=False),
            nn.Conv2d(192,256,3,stride=2,padding=0),
            nn.LeakyReLU(0.2,inplace=False),
            nn.Conv2d(256,256,3,stride=2,padding=0),
            nn.LeakyReLU(0.2,inplace=False),
            nn.Conv2d(256,1,1,stride=1,padding=0)
        )

    def forward(self, x):
        # x: B x 2 x 1024 x 1024
        return self.main(x) # B x 1 x 9 x 9

Thanks a lot

JuanFMontesinos · August 3, 2020, 10:36pm

It seems that you have reduced the size too much.
In short, from one of the convolutions onwards your feature map is already smaller than 3x3 (the kernel size). In fact 1x1 (so it’s not even 2d but a vector)

Use a bigger input image or delete some convolutions

By default VGG images are 112x112 if i’m not wrong (so that is the optimal size for the pretrained wiegths)

alameer · August 3, 2020, 10:44pm

Many Thanks for your help. I have increased the image size to 1024 x 1024 has resolved the issue. If the optimal size is 112x112, and i am using 1024x1024. How would you anticipate this to affect my training model? Thanks

JuanFMontesinos · August 4, 2020, 9:37am

Soo the point is that original vgg was using that size. Network is “used to see” objects whose sizes are contained in a 112x112 image.

There is something called receptive field (rather than boring you with a shitty explanation I will link to a blog https://towardsdatascience.com/understand-local-receptive-fields-in-convolutional-neural-networks-f26d700be16c) soo if the input image is too large and the network is not deep enough, this may harm the results.

Soo imagine that you are a neuron and you have to describe what you see. If you are too closed to an object you will be able to provide fine details but without a “panoramic” context. However if you are too far, you will provide a good overall description but coarse details.

In that way it is said that first layers learn basic features (gradients) meanwhile deeper layers learn more abstract features.

Soo to address this problem several techniques were developed like pooling or dilated convolution

alameer · August 8, 2020, 2:42pm

@JuanFMontesinos I hope you could help me again, please? I’m still having an issue with running the model - It keeps running out of memory. Even after reducing the size of the image. Each step during the training takes a long time to complete. I think the issue is with the snippet of code below, where I read the image then split by color. If i run the model without the the image slpit it works okay, but i need to spit the image by channel to do the counting. I have tried changing the batch size to 8 from 16. The only different in this model is also the change to grayscale which i don’t seem to be able to becuase the image is in PIL format. Can you suggest anything please?

 def __getitem__(self, idx):
        path = self.files[idx]

        img = imread(filename=path)
        sample = resize(src=img, dsize=(1024, 1024))
        #sample = functional.to_grayscale(sample, num_output_channels=3)
        #image = cv2.imread(path)

        sample[:, :, 0] = 0
        sample[:, :, 1] = 0

JuanFMontesinos · August 8, 2020, 2:59pm

Hi,
I don’t know at all how does it internally replace the values.
You can try

        sample[:, :, 0].zero_()
        sample[:, :, 1].zero_()

(Assuming that sample is a torch tensor)
That’s in-place replacement which should consume no extra memory.

It’s normal a single iteration takes a long time since you are running it on CPU.
You can try to use google colab

alameer · August 8, 2020, 3:06pm

sample is no a numpy array as i had to change it to split the image. Is there a quick method to change it back please?

    sample[:, :, 0].zero_()
AttributeError: 'numpy.ndarray' object has no attribute 'zero_'

Thanks

JuanFMontesinos · August 9, 2020, 9:25am

I really have no idea sorry.
I don’t really know the details about numpy’s operators.

alameer · August 11, 2020, 2:29pm

Many Thanks for all of your help. I have changed the image size to 256 x 256, and batch size to 16 which helped in getting to resolve the memory issue. Thanks