Building simple, pre-tested NN but running into memory problems

Edwin_Meyers · August 4, 2020, 2:20am

Thank you very much in advance for your help.

I want to train an image classifying NN but am running into the following memory Error:

RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 123681636352 bytes. Error code 12 (Cannot allocate memory)

I found this error strange because I am running on an ml.t2.xlarge SageMaker instance (lots of memory). Here’s a bit of context on the code:

I wrote a python script to:
i) load a file list (csv file) into the dataloader
ii) convert input images from TIFF to Numpy to Torch
iii) permute channels (H, W, C vs C, H, W)
iv) crop picture into smaller picture to avoid memory problems (dataset has 380GB of images, each .TIFF images is about ~25MB)

Here is the python script:

class load_csv(Dataset):
    def __init__(self, csv_file, root_dir, transform=None):
        self.annotations = pd.read_csv(csv_file)
        self.root_dir = root_dir
        self.transform = transform
    
    def __len__(self):
        return len(self.annotations)
    
    def __getitem__(self, index):
        img_path = os.path.join(self.root_dir, self.annotations.iloc[index, 0])
        image = torch.from_numpy(tiff.imread(img_path)).permute(2,0,1).float()
        #Image.MAX_IMAGE_PIXELS = None
                
       image.transform = transforms.RandomResizedCrop(224)
        
        y_label = torch.tensor(int(self.annotations.iloc[index, 1]))
        
        #if self.transform:
        #    image = self.transform(image)
        
        return (image, y_label)

Now, I’m running into a memory-related problem that I don’t know how to resolve. I am running a ml.t2.xlarge SageMaker notebook with plenty of memory and images being loaded into the dataloader have supposedly undergone a “transforms.RandomResizedCrop(224)” transformation and should be smaller.

Why am I running into this problem?

Please find project repo with CNN.ipynb and python script csv_loader.py where getitem class is defined for dataloader (here).

Thank you very much again,
Edwin

Edwin_Meyers · August 4, 2020, 3:32am

FOLLOW UP:

I added num_workers=3 to the dataloaders hoping it would help.

I still memory get a memory related error. I ran this on a fresh kernel with no operations on it. Here is the error I get. :

RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/data/dataset.py", line 257, in __getitem__
    return self.dataset[self.indices[idx]]
  File "/home/ec2-user/SageMaker/csv_loader.py", line 21, in __getitem__
    image = torch.from_numpy(tiff.imread(img_path)).permute(2,0,1).float()
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 10856909952 bytes. Error code 12 (Cannot allocate memory)

Edwin_Meyers · October 10, 2020, 8:57pm

[SOLUTION]

The image was a multi-level TIFF file. If the file level is not specified, the first level is taken by default. In this case, each first-level image was between 25 to 50MB, sometimes more. The Dataloader batch size needed to be reduced to 1 in order not to generate errors. In the end, I chose 2nd level images to train and test the model.

Blessings