Contribution: Dataloader with workers on remote nodes

nlgranger · August 2, 2022, 12:51pm

Hi.

In my lab we have a situation where GPU servers have too few cpus to load and preprocess data. As a result it is difficult to keep the GPUs busy and our experiments take longer than we want.
To work around this I have developed a Dataloader which dispatches item requests to other servers via remote procedure calls over the network. That way, one can offload the dataloader workers to one or several cpu nodes.

It’s over here: GitHub - CEA-LIST/RPCDataloader: A variant of the PyTorch Dataloader using remote workers.

The interface is similar to the regular Dataloader except for a few modifications:

Workers must be started manually on remote nodes (you can even put some of them locally if you have spare CPU cores).
The dataloader takes the dataset constructor and arguments instead of the dataset instance itself (the dataset object gets instantiated on the workers).
The dataloader needs the list of workers IPs and ports

Note that you might need a fast network connection.

nlgranger · April 21, 2023, 9:00pm

I have updated the API to make it look even more like the traditional Dataset and Dataloader classes. It should make it easier to interface with third party libraries.

Spot the differences!

    # argparser.add_argument("--num-workers", type=int)
    argparser.add_argument("--workers", type=str, nargs="+")

    ...

    # Datasets
    train_dataset = RPCDataset(
        args.workers,
        ImageFolder,
        root=args.data_path + "/train",
        transform=train_transform,
    )

    ...

    train_loader = RPCDataloader(
        train_dataset,
        batch_size=args.batch_size,
        sampler=train_sampler,
        pin_memory=True,
        # num_workers=args.num_workers
    )

Check the full example here.

ptrblck · April 21, 2023, 11:24pm

Thanks for sharing the code!
Do you have some performance numbers for the local data loading on machines with “too few cpus to load and preprocess data” vs. the offloaded use case?
It would be interesting to see at which point and network speed you would see a benefit as I would expect the network bandwidth to be a significantly limiting factor.

nlgranger · April 22, 2023, 9:28am

Thanks!

I just did a test, training swinv2_S on ImageNet with 8 A100:

nvidia-smi reports a load above 80%.
Network bandwidth is 1.5G/s (IP over Infiniband in this case).
Each train script uses 1.2 CPUs, thus about 10 cores overall on the GPU node.

So a fast interconnection network will be needed but it is not that crazy for many use-cases. Since the library does not incur a significant CPU overhead, you can use it even for local data workers and have the flexibility to expand to another node if needed.

I think this approach is also relevant for smaller models. A job scheduler can squeeze a job on a node which is almost full in terms of CPUs but has a couple of free GPUs, thus improving cluster utilization.