Split the dataloader and save / load, result are not identical

Leonmac · June 16, 2022, 10:51am

Hello,
I want to make a small tool that can do data-set pre-splitting before the train happen.
What I did:

split the original testloader to three sub-testloader return as a dataloader list–by using torch.utils.data.random_split,
save all testloaders to three *.pt file to disk – by using torch.save
reload three *.pt files to new testloaders from disk --by using torch.load

Basically, I am following the suggestion of here for save/load the dataloader

When I compare post_step1 testloader1 vs. post_step3 reload_testloader1 – I sample out the 1st batch of testloader and plot the image, then I find they are not the same.

Initially, I suspect if there are some random/shuffle flag impacts, but given at end of step1, all three testloader should be already confirmed, I couldn’t figure out where could be the problem.

Is there any tricky about the torch.save/torch.load for the testloader object? the saving/load is quite simple.

def save_dl(dataloader_obj:DataLoader, file_name: str) -> None:
    """save the dataloader to pt file"""
    torch.save(dataloader_obj, file_name)

def load_dl(file_name: str) -> DataLoader:
    """load the dataloader from  pt file"""
    return torch.load(file_name)

Leonmac · June 17, 2022, 3:00am

OK after some further test, I think I find the reason – i.e. the “shuffle” option defined when Dataloader obj is created, decides that each time you load the dataloader, whether the contents will be shuffled or not, this is preserved when you save it to disk.