Working with very huge .tif files

Hi,

I have a dataset which comprises over a thousand high-resolution whole-slide digital pathology images, and my goal is to create a classifier.

The problem that I’m facing is, I’m unable to train an image classifer due to high memory usage issues [tried to allocate more memory than is available. Session has restarted.]
and each .tif file has a dimension of (60797, 34007, 3), and I want to scale them down without losing critical information.

Can anyone help me on how to work with these huge .tif files. Thanks

The tensor of dimension (60797, 34007, 3) is a serious memory problem. If you can’t scale it down, then all I can suggest is to crop it into meaningful small patches.

Thanks,

but, I am just a beginner, and below is my code. Can you please give me an example code of how to convert it into small patches.

import numpy as np 
import pandas as pd 

from tifffile import imread

class ImageData():
    def __init__(self, df, data_dir, transform):
        super().__init__()
        self.df = df
        self.data_dir = data_dir
        self.transform = transform

        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, index):       
        img_name = str(self.df.image_id[index]) + '.tif'
        label = self.df.label[index]          
        img_path = os.path.join(self.data_dir,img_name)   
         
        image= imread(img_path)

        
        image = self.transform(image)
        return image, torch.tensor(label)

data_transf = transforms.Compose([transforms.ToPILImage('RGB'),
                                  transforms.Resize((224,224)),
                                  transforms.RandomRotation(137.5),
                                  transforms.ToTensor(),
                                  transforms.Normalize(mean = [0.610, 0.377, 0.233],
                                                       std = [0.377, 0.233, 0.144])])

train_data = ImageData(df = df, data_dir = image_path, transform = data_transf)
trainloader = DataLoader(dataset = train_data, batch_size=4)

And, I forgot to mention that my memory isuue, [tried to allocate more memory than is available. Session has restarted. ] occurs when the training loop starts.

So from the code that you share, it seems like you are reducing the image from (60797, 34007, 3) to (224, 224, 3) then you are applying random rotations and many other transformation. Now the question is where are you getting the memory error? On CPU or on GPU.

Because the Dataloader is returning the size ( 3, 224,224) tensor. It should not cause memory problem in gpu unless you have big batch size.

True, its not causing in GPU, but when my program executes this line of code in the training loop:

for images,labels in trainloader:

my CPU RAM reaches its limit and I get the above mentioned error, My CPU has a RAM of 13GB

Excellent! So this means when you’re loading this big images they takes all the space on your RAM. Solution would be to perform preprocessing ahead of training so you don’t get into these issues.

So, do you mean by manually converting (60797, 34007, 3) to (224,224,3) solves the problem!

Let me give it a try then

Note that each image of [60797, 34007, 3] will take ~23GB in float32 and ~5.8GB in uint8, which might already fill up your host RAM.

Yes, I’m unable to load a single image. Any ideas on how to deal with these kind of problems. Each image is atleast 1.2GB in size