Best practices for training on 500GB of instances, all are large

I am curious what are best practices for loading training instances that are 220K floats from disk, over 500GB of instances. I am training on audio. I am maxing out at minibatch size 2048, otherwise GPU memory is exhausted. I am using pytorch-lightning but fine with converting to vanilla torch.

I am concerned that if I switch to multiple GPUs, the training speed with nonetheless be gated by the speed of the harddisk, not how many GPUs I have.