How to load all data into GPU for training

You would have to push it onto the GPU with data ='cuda'). Then you can check the device using print(data.device).


I got cuda:0 as output of print(data.device), does it mean all data are already in GPU memory? If so, what might be the reason that dataloader takes 70% of the computation time?

Yes, it is on the first GPU device.
How do you time your code? Could you share a small example?
Are you processing the data somehow in your Dataset?

Following is the main part of my code (for simplicity I remove implementation of model and loss function):

dataset = TensorDataset(data_1, data_2)
train_loader = DataLoader(dataset, batch_size=5000, shuffle=True, drop_last=False)

optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0)
epochs = 100
for index_epoch in range(epochs):
    print index_epoch
    for x_1, x_2 in train_loader:
        z_1 = model(x_1)  # model is the neural network model
        z_2 = model(x_2)
        loss = get_loss(z_1, z_2)
        # print loss_list

I time it using python cProfile, and it turns out 70% of time is on, while only <10% on actual backward propagation. There is no additional processing of my data, I simply load it with TensorDataset and DataLoader. And the shape of my data is (5000000, 1).

1 Like

Thanks for the information!
How big is your model? Could you profile it again with torch.utils.bottleneck?

Since CUDA operations are run asynchronously, your DataLoader might have to wait for the CUDA op to finish, thus reporting a false number.

An alternative would be to add torch.cuda.synchronize() calls before and after the forward and backward calls, and time it manually.


Sorry for the late reply! I used bottleneck module to profile it, and I still see that most of running time is spent on loading data and time on backward is quite small:

------------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       CPU time        CUDA time            Calls        CPU total       CUDA total
------------------  ---------------  ---------------  ---------------  ---------------  ---------------
stack                  878842.491us    1013487.305us                1     878842.491us    1013487.305us
stack                  868481.575us     998066.406us                1     868481.575us     998066.406us
stack                  861072.974us    1006662.109us                1     861072.974us    1006662.109us
stack                  860799.906us     995949.219us                1     860799.906us     995949.219us
stack                  216507.028us     249775.391us                1     216507.028us     249775.391us
stack                  213380.171us     247549.805us                1     213380.171us     247549.805us
ExpandBackward          20528.386us        114.746us                1      20528.386us        114.746us
sum                     20522.516us        110.840us                1      20522.516us        110.840us
_sum                    20509.686us        102.051us                1      20509.686us        102.051us
ExpandBackward          20494.596us         50.781us                1      20494.596us         50.781us
sum                     20489.766us         46.875us                1      20489.766us         46.875us
_sum                    20479.066us         41.016us                1      20479.066us         41.016us
mean                    11524.652us         51.270us                1      11524.652us         51.270us
mean                    11162.832us         66.406us                1      11162.832us         66.406us
mean                     9420.275us         62.500us                1       9420.275us         62.500us

Have you tried not using a DataLoader …? if the dataset is 1D, and sufficiently small, you could just preload it in the GPU. Then write a custom loader that directly uses indices into your TensorDataset , should significantly speed up your training epochs.

Sorry for late post but can you tell something more how it should look like?

  • given some dataset X, sufficiently small
  • create TensorDataset : by converting X to torch.tensors, loading into your gpu (.to(your_gpu)), and using the build in class
  • create an custom loading class , i.e. just a class that iterates through the TensorDataset.
  • done
1 Like

Is there an example of this you can point to? My dataset is roughly 1.5GB and seems like it would fit entirely on GPU. I’m currently using DataLoader to feed minibatches to the GPU. I’m a newb at pytorch, but it seems like if the Dataloader (or some equivalent) as well as the model were on the GPU, things would go much quicker.

1 Like

is there a way to give the whole dataloader to gpu (if it has enough memory) after we get our dataloader like this:

train_loader = DataLoader(dataset, batch_size=5000, shuffle=True, drop_last=False)

I am gonna iterate through train_loader and do every iteration.
is there a way to move all data to gpu so I dont have to do .to(device) in iteration?

1 Like

If you are preloading the data in your Dataset, you could directly push it to the GPU, e.g. in the __init__ method.
Usually you are lazily loading and processing each sample, which means you would have to push each sample to the device.


In case of multi gpu, can we still do this? I have two gpus, each has enough memory to load the data into the gpu before training. If I load the data and train it with single gpu, the gpu utilization is 25% higher than loading from cpu at each batch. However, if I load to gpu and train it with two gpus the performance is worse than loading from cpu. I guess this happens because it’s moving data from gpu0 to gpu1 which becomes a bottleneck.

If you are using nn.DistributedDataParallel, each process could only load the (subset of the) data. nn.DataParallel creates model replica on each device for each forward pass, splits the data tensor in the batch dimension (dim0) and sends a chunk of the data to each device.
I’m not sure which approach you are using, but DDP should be faster.

I am currently using nn.DataParllel but I will give DDP a try.

I believe I am facing the same issue. Can you point me to an example of how load all training data to the GPU in the data loader? My images are currently on disk in labeled subdirectories (about 25,000 images).

Also - if the push to GPU is implemented within the data loader, I assume that this removes the ability to be device-agnostic, right?

(BTW, I have only one GPU device.)

If you push the complete data to the GPU, you could still use a DataLoader for batching and shuffling, but multiple workers won’t do much (and might yield errors for multiple CUDA context initializations).

Anyway, the easiest approach would be to load your data beforehand, push it to the GPU via:

data ='cuda')
target ='cuda')

and create a TensorDataset.
Once this is done, you could wrap it into a DataLoader with num_workers=0 and train your model.

I don’t know how large each image is, but assuming you are using images of the shape [3, 224, 224], a dataset of 25000 images will takes approx. 25000*3*224*224*4/1024**3 = 14GB of GPU memory.
So you should consider, if you want to spare that much memory for the data alone.

1 Like

Thank you. In your code snippet, what is “data”? I mean, what form is it in/ how is it initialized?

The images are gray scale - but the raw images are 1000x1000 so the full dataset is more than 20 GB. During training, they’re subsampled down to 32x32, though. Maybe I can figure out how to do this once on the CPU, then send the subsampled data to the GPU. Would that work?

I will check out TensorDataset.

1 Like

data and target would be returned by the DataLoader in the typical loop:

for data, target in loader:
    data ='cuda')
    target ='cuda')

Ah OK, that would change the memory footprint to 25000*32*32*4/1024**2 = 97.7MB.

Yes that should work. You could iterate the Dataset once, loading and resizing each sample in its __getitem__ method and appending these samples to a list.
Once this is finished, you can use data_all = torch.stack(data_list) to create a tensor and save it via

In your training, you would reload these samples using torch.load and push it to the device.

Note however, that this approach would limit your ability to apply data augmentation, as most of the torchvision.transforms are implemented to PIL.Images, which are using numpy arrays under the hood.

Let me know, if that would work.

Can the images be sent to the gpu as PIL? (Rather than as tensors?) Basically, I’d split the transformation pipeline - subsample the PIL, send to the gpu, do the augmentation and tensor conversion on the gpu. Not that I really know how to do this yet ;^) but I’d like to know if this is a feasible approach.

Sorry if my questions are basic. It does seem that this issue of really large datasets must come up all the time in ML applications.