Reproducibility, DataLoader: shuffle=True, using seeds

MikeTensor · March 1, 2023, 9:42pm

I am concerned about my Reproducibility.
Is there a way to use seeds and shuffle=True and keep Reproducibility?

Let’s say I would use:

def set_seeds(seed: int=42):
    """Sets random sets for torch operations.

    Args:
        seed (int, optional): Random seed to set. Defaults to 42.
    """
    # Set the seed for general torch operations
    torch.manual_seed(seed)
    # Set the seed for CUDA torch operations (ones that happen on the GPU)
    torch.cuda.manual_seed(seed)

togehter with the DataLoader:

train_dataloader = DataLoader(dataset=train_data,
                               collate_fn=None,
                               batch_size=None, # how many samples per batch?
                               num_workers=1, # how many subprocesses to use for data loading? (higher = more)
                               shuffle=True,
                               pin_memory=True)

Prob. I will also splitt the Data (train, val).

How do I get the order of images keep being the same?

Since my problem with the DataLoader (Wrong/different image shapes after DataLoader; bug in DL?) it seems that I have to use shuffle.

ptrblck · March 2, 2023, 10:06am

Seeding the code before iterating the DataLoader or rather before the iterator is created should work for a simple Dataset as seen here:

dataset = TensorDataset(torch.arange(40).view(-1, 1))
loader = DataLoader(dataset, num_workers=2, shuffle=True, batch_size=10)

torch.manual_seed(2809)
for data in loader:
    print(data)
    
torch.manual_seed(2809)
for data in loader:
    print(data)
    
torch.manual_seed(2809)
iter_loader = iter(loader)
while True:
    try:
        data = next(iter_loader)
        print(data)
    except StopIteration:
        print("Done")
        break

In case you are using 3rd party libraries inside your Dataset.__getitem__ you might need to seed these in the worker_init_fn if you are using multiple workers.