What memory is used when downloading data on each rank - multiple GPUs

I have I am sure a basic question / misunderstanding. I am training a model using DDP on AWS Sagemaker. I am using a single instance with multiple GPUs. There is a large dataset on S3 and it seems I need to download the data on each of the ranks. The data is downloaded and then a dataset/data loader is created. If data is downloaded - read into memory via pandas - does that mean that the GPU memory is being used to read in the data into pandas since it needs to happen on each rank?

Typical DDP setups use a DistributedSampler to avoid reusing the same samples and to split them among the ranks. Are you using such a sampler already?

I am. To give a little more detail, I am essentially following tutorials such as examples/distributed/ddp-tutorial-series/multigpu_torchrun.py at main · pytorch/examples · GitHub. (Imagine in this script instead of generating the random data, it is reading a text file from S3.)

where for each rank (GPU when using a single instance/node with multiple GPUs) the same dataset appears to be read into memory. So, in my script I have a step that reads data from S3 into a pandas dataframe. It seems I need to do this on each rank. When doing this, is the process loading the data into GPU memory?

I assume you are loading the entire dataset into the host RAM on each rank which is a bit wasteful. A common approach would be to use a lazy loading approach. In this case you would load the dataset location or sample paths in the Dataset.__init__ method and would load the needed sample in the Dataset.__getitem__. The sampler would then make sure to only load the needed samples on each rank.

Yes I think I am. In a script using DDP, would this mean that the data is being loaded into GPU RAM?

Dataset.getitem reads from disk, is that would you mean?

Not necessarily as it depends on the Dataset.__getitem__ implementation and if you would move the data to the device (which is not the standard use case as the data is loaded on the host in the Dataset and moved inside the training loop to the GPU).

1 Like

So when a dataset (e.g. csv read into pandas) occurs in a script using DDP as in the tutorial above ( examples/distributed/ddp-tutorial-series/multigpu_torchrun.py at main · pytorch/examples · GitHub) on a single instance with multiple GPUs like this:

df = pd.read_csv(‘s3://location_of_file’)
and then its loaded like into pytorch like:

X = torch.tensor(df.values, dtype=torch.float32)
loader = DataLoader(X)

this occurs in some non GPU memory that is somehow associated with each process (rank)?

Yes, in host RAM.

It’s just the RAM available in your system that can be used by all processes.