How to run the dataloading on a separate machine?


In order to avoid a CPU bottleneck to consume GPU budget, a common idea is to offload dataloading to a fleet of CPU instances, and send clean batches to the fleet of GPU instances. This is for example documented here:

TF has the TF Data Service for this. Does PyTorch has a simple solution to do this? Having the dataset + dataloader running on one fleet, and having the training loop pulling clean batches on a separate fleet?

the SageMaker sample does it by creating a custom Dataset class pulling records from a grpc service, but I’m wondering if there is something more built-in.

Thank you sharing your use case. We do notice such trend especially in production case and we are discussing on this topic. When we design the new TorchData project to accommodate such use case. However, since this is currently tied to how your fleet was launched and how the communication between those fleets are unknown for us, we don’t have a generalized solution for now.