I’m attempting to load in about 20 .npz files, each about 4 GB large, then concatenate them into one large tensor for my training set (my test set is about half that size). I have about 10 GB or RAM on my machine.
My general procedure thus far has been:
class CustomTensorDataset(Dataset):
def __init__(self, data_tensor):
self.data_tensor = data_tensor
def __getitem__(self, index):
return self.data_tensor[index]
def __len__(self):
return self.data_tensor.size(0)
def return_data(args):
train_data = torch.tensor([])
for train_npz in os.listdir(train_path):
data = np.load(Path(train_npz))
data = torch.from_numpy(data['arr_0']).unsqueeze(1).float()
data /= 255
train_data = torch.cat((train_data,data))
train_kwargs = {'data_tensor':train_data}
dset = CustomTensorDataset
tr = dset(**train_kwargs)
train_loader = DataLoader(dataset=train_dataset,
batch_size=batch_size,
shuffle=True,
num_workers=num_workers,
pin_memory=True,
drop_last=True)
This isn’t working for memory issues obviously. I’ve looked into a few different options like loading in images lazily (like here Loading huge data functionality), which I tried and wasn’t seeming to work, and caching (like here Request help on data loader) though I’m not sure that either of these work very well for my data format.
Any help on how to do this efficiently would be great. Thanks.