In order to avoid a CPU bottleneck to consume GPU budget, a common idea is to offload dataloading to a fleet of CPU instances, and send clean batches to the fleet of GPU instances. This is for example documented here:
- SageMaker Heterogeneous Clusters
- Overcoming Data Preprocessing Bottlenecks with TensorFlow Data Service, NVIDIA DALI, and Other Methods
TF has the TF Data Service for this. Does PyTorch has a simple solution to do this? Having the dataset + dataloader running on one fleet, and having the training loop pulling clean batches on a separate fleet?