Using DDP with WebDataset

This is my first time using WebDataset and I have multiple shards (about 60) with a large number of images. It was working as I would expect in the normal Dataset class when I was using a single GPU. However once I set the devices to 2 I received the error ValueError: you need to add an explicit nodesplitter to your input pipeline for multi-node training Webdataset.
I saw two approach to allow using multiple gpus with WebDataset.

  1. Using .with_epochs

According to WebDataset Github I could simply use the with_epochs function in my dataset as follows:

dataset = wds.WebDataset(url, resampled=True).shuffle(1000).decode("rgb").to_tuple("png", "json").map(preprocess).with_epoch(10000) 
dataloader = wds.WebLoader(dataset, batch_size=batch_size)
  1. Using ddp_equalize
    According to WebDataset MultiNode
dataset_size, batch_size = 1282000, 64 
dataset = wds.WebDataset(urls).decode("pil").shuffle(5000).batched(batch_size, partial=False) 
loader = wds.WebLoader(dataset, num_workers=4) loader = loader.ddp_equalize(dataset_size // batch_size)

Could someone please help me understand what is happening in these two pieces of code and how they are different. In the second case is the dataset_size just a nominal size? Which if any is better? I would also appreciate if someone has an example of what is the best way to use Webdataset with pytorch lightning in multi-gpu and multi-node scenario.

Hi, this is for the PyTorch forums so I don’t think many members will have experience / understand WebDataset in detail. You might get more insight posting on that repo’s github issues. However, if there is any issues within the torch.distributed package then please post here.