How to solve problem of memory error in multi label classification?

Nisucuk · June 30, 2019, 5:07am

I have 5529 labels,
and each image has variable number of labels.
In my final result, i can have upto 100 labels.
I made my custom dataset following this code. https://www.kaggle.com/mratsim/starting-kit-for-pytorch-deep-learning
class myCustomDataset(Dataset):
“”“my dataset.”""

def __init__(self, csv_file, root_dir,img_ext,transform=None):
    """
    Args:
        csv_file (string): Path to the csv file with annotations.
        root_dir (string): Directory with all the images.
        transform (callable, optional): Optional transform to be applied
            on a sample.
    """
    #.iloc[:,0]=[[0]]
     #
    tmp_df = pd.read_csv(csv_file,sep=';',header=None)
    assert tmp_df.iloc[:,0].apply(lambda x: os.path.isfile(root_dir + x + img_ext)).all(), \

“Some images referenced in the CSV file were not found”

    self.mlb = MultiLabelBinarizer()
    self.root_dir = root_dir
    self.img_ext = img_ext
    self.transform = transform

    self.X_train = tmp_df.iloc[:,0]
    self.y_train = self.mlb.fit_transform(tmp_df.iloc[:,0].str.split()).astype(np.float32)

        
        
        

def __len__(self):
    return len(self.X_train.index)

def __getitem__(self, index):
    
    img = Image.open(self.root_dir + self.X_train[index] + self.img_ext)
    img = img.convert('RGB')
    if self.transform is not None:
        img = self.transform(img)
    
    label = torch.from_numpy(self.y_train[index])
    
    return img, label

now,i give my validationset to the dataset and create dataloader.

transformedvalid_dataset= myCustomDataset(csv_file=’/home/nis/Downloads/trialdata/Validation-Concepts.csv’,
root_dir=’/home/nis/Downloads/trialdata/validation-set/’,
img_ext=’.jpg’,
transform=transforms.Compose([transforms.RandomSizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
)
I am able to create my custom dataset ,and make validation loader.
print(len(transformedvalid_dataset))
14157

trainloader = torch.utils.data.DataLoader(transformedvalid_dataset, batch_size=32, shuffle=True)
dataiter = iter(trainloader)
images, labels = dataiter.next()
print(type(images))
print(images.shape)
print(labels.shape)

<class ‘torch.Tensor’>
torch.Size([32, 3, 224, 224])
torch.Size([32, 14157])

But when i do same for train images(56638 images), it gives memory error when i provide my data set. I am far from making data loader work in case of train images as it gives memory error while giving images to dataset. Data loader would be the next step and batch size is provided at data loader.
i could not figure out or find any links to correct the error.

how can i solve it? Also,my labels shape is equal to numbers of images in validation set,is it correct?
Also, how to know how data looks at
self.y_train = self.mlb.fit_transform(tmp_df.iloc[:,0].str.split()).astype(np.float32)
and
label = torch.from_numpy(self.y_train[index])

ptrblck · July 1, 2019, 11:23am

What kind of memory error do you get?
Could you post the stack trace so that we can have a look?

Nisucuk · July 4, 2019, 2:27pm

I solved the problem. I had to put comma inside split function.

ricslator · May 10, 2021, 7:41am

Memory errors happens a lot with python when using the 32bit Windows version . This is because 32bit processes only gets 2GB of memory to play with by default.

The solution for this error is that pandas.read_csv() function takes an option called dtype. This lets pandas know what types exist inside your csv data.

For example: by specifying dtype={‘age’:int} as an option to the .read_csv() will let pandas know that age should be interpreted as a number. This saves you lots of memory.

pd.read_csv('data.csv',dtype={'age':int})

Or try the solution below:

pd.read_csv('data.csv',sep='\t',low_memory=False)