I’m training a CNN that classifies small regions (7x7) from 40 large input layers that are all used to do the classification. So the CNN takes in a (40, 7, 7) input to be classified.
But the 40 input layers themselves though are very large (over 10k x 10k pixels), and so I can’t read all 40 files in the dataset constructor because that goes beyond the amount of RAM I have.
So instead, I’m forced to read all 40 files during every
__getitem__ call of the dataset loader, and just read the desired 7x7 location inside the input layers for this iteration.
But this has made my training very slow as I think it’s just taking a long time to open and read the windows from all 40 files every iteration.
I am already using multiple workers
torch.utils.data.DataLoader(dset, batch_size=32, num_workers=4)
class MyDataset(Dataset): def __init__(self): self.all_input_layers = ["file1", "file2", ..., "file40"] def __getitem__(self, idx): all_layer_data =  for (file in self.all_input_layers): curr_window = read_file_window(file) # returns 7x7 np array all_layer_data.append(curr_window) data = np.concatenate(all_layer_data, axis=0) # 40 x 7 x 7 data that will be inputted into CNN return data
What are the other things I should try to speed this up?
- Should I use
- I don’t have any issues with too many file descriptors being open, which is the case that using this ‘file_system’ strategy seems to be recommended for, so not sure if it would help me.
- What else?