Confusion about hard drive size of torch.save(Dataset) with random_split)

NovoG93 · December 8, 2022, 4:34pm

I created a class that inherits from torch.utils.data.Dataset and split it using torch.utils.data.random_split(dataset, [TRAIN_SIZE, TEST_SIZE]).
Afterwards I load it via the torch.utils.data.DataLoader interface and finally save it using torch.save to the hard drive.

dataset = DataSetClass(df=DATA, transform=transform, device, window=3)

DATASET_LENGTH = len(dataset)
TRAIN_SIZE = int(DATASET_LENGTH * 0.75)
TEST_SIZE = DATASET_LENGTH - TRAIN_SIZE

train_dataset, test_dataset = random_split(dataset, [TRAIN_SIZE, TEST_SIZE])
dataloader_train = DataLoader(dataset=train_dataset, batch_size=1, shuffle=False, num_workers=4)
dataloader_test = DataLoader(dataset=test_dataset, batch_size=1,  shuffle=False, num_workers=4)

torch.save(dataloader_train, "/path/to/train.pt")
torch.save(dataloader_test, "/path/to/test.pt")

However the saved files are of the same size on the hard drive, why is that?

ptrblck · December 9, 2022, 5:20am

I think torch.save(dataloader, PATH) would try to pickle and store the actual Python DataLoader object without the internal data, as e.g. your Dataset could load the actual samples lazily, so you might want to store the real DATA object instead of the DataLoader wrapper.

NovoG93 · December 9, 2022, 4:26pm

Thank you for your reply!
I will look into that.