This is my first time using WebDataset and I have multiple shards (about 60) with a large number of images. It was working as I would expect in the normal
Dataset class when I was using a single GPU. However once I set the
devices to 2 I received the error
ValueError: you need to add an explicit nodesplitter to your input pipeline for multi-node training Webdataset.
I saw two approach to allow using multiple gpus with
According to WebDataset Github I could simply use the
with_epochs function in my dataset as follows:
dataset = wds.WebDataset(url, resampled=True).shuffle(1000).decode("rgb").to_tuple("png", "json").map(preprocess).with_epoch(10000) dataloader = wds.WebLoader(dataset, batch_size=batch_size)
According to WebDataset MultiNode
dataset_size, batch_size = 1282000, 64 dataset = wds.WebDataset(urls).decode("pil").shuffle(5000).batched(batch_size, partial=False) loader = wds.WebLoader(dataset, num_workers=4) loader = loader.ddp_equalize(dataset_size // batch_size)
Could someone please help me understand what is happening in these two pieces of code and how they are different. In the second case is the
dataset_size just a nominal size? Which if any is better? I would also appreciate if someone has an example of what is the best way to use Webdataset with pytorch lightning in multi-gpu and multi-node scenario.