How to load all data into GPU for training

Hi, I am using a set of 1D data for training and I noticed that GPU usage is quite low (<5%) and training takes very long time to finish. I profiled my code (following instruction here: https://www.sagivtech.com/2017/09/19/optimizing-pytorch-training-code/) and found that 73% of the computation time is spent on loading data. I wonder if it is possible to load all data into GPU memory to speed up training, and tried to include pin_memory=True in my code, but it told me “cannot pin ‘torch.cuda.FloatTensor’ only CPU memory can be pinned”. Does anyone have idea how I should do this?

Thank you!

2 Likes

To speed up the transfer of data from the host to your device you could use pin_memory=True. However, if your data already is on the GPU you don’t need this function and it will throw an error.
If you have enough GPU memory to hold the data and model this is surely a valid approach to save the loading overhead.

6 Likes

Thanks!

How do I know if my data are in GPU memory?

You would have to push it onto the GPU with data = data.to('cuda'). Then you can check the device using print(data.device).

5 Likes

I got cuda:0 as output of print(data.device), does it mean all data are already in GPU memory? If so, what might be the reason that dataloader takes 70% of the computation time?

Yes, it is on the first GPU device.
How do you time your code? Could you share a small example?
Are you processing the data somehow in your Dataset?

Following is the main part of my code (for simplicity I remove implementation of model and loss function):

dataset = TensorDataset(data_1, data_2)
train_loader = DataLoader(dataset, batch_size=5000, shuffle=True, drop_last=False)

optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0)
epochs = 100
for index_epoch in range(epochs):
    print index_epoch
    for x_1, x_2 in train_loader:
        optimizer.zero_grad()
        z_1 = model(x_1)  # model is the neural network model
        z_2 = model(x_2)
        loss = get_loss(z_1, z_2)
        loss.backward()
        # print loss_list
        optimizer.step()

I time it using python cProfile, and it turns out 70% of time is on dataloader.py, while only <10% on actual backward propagation. There is no additional processing of my data, I simply load it with TensorDataset and DataLoader. And the shape of my data is (5000000, 1).

1 Like

Thanks for the information!
How big is your model? Could you profile it again with torch.utils.bottleneck?

Since CUDA operations are run asynchronously, your DataLoader might have to wait for the CUDA op to finish, thus reporting a false number.

An alternative would be to add torch.cuda.synchronize() calls before and after the forward and backward calls, and time it manually.

5 Likes

Sorry for the late reply! I used bottleneck module to profile it, and I still see that most of running time is spent on loading data and time on backward is quite small:

------------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       CPU time        CUDA time            Calls        CPU total       CUDA total
------------------  ---------------  ---------------  ---------------  ---------------  ---------------
stack                  878842.491us    1013487.305us                1     878842.491us    1013487.305us
stack                  868481.575us     998066.406us                1     868481.575us     998066.406us
stack                  861072.974us    1006662.109us                1     861072.974us    1006662.109us
stack                  860799.906us     995949.219us                1     860799.906us     995949.219us
stack                  216507.028us     249775.391us                1     216507.028us     249775.391us
stack                  213380.171us     247549.805us                1     213380.171us     247549.805us
ExpandBackward          20528.386us        114.746us                1      20528.386us        114.746us
sum                     20522.516us        110.840us                1      20522.516us        110.840us
_sum                    20509.686us        102.051us                1      20509.686us        102.051us
ExpandBackward          20494.596us         50.781us                1      20494.596us         50.781us
sum                     20489.766us         46.875us                1      20489.766us         46.875us
_sum                    20479.066us         41.016us                1      20479.066us         41.016us
mean                    11524.652us         51.270us                1      11524.652us         51.270us
mean                    11162.832us         66.406us                1      11162.832us         66.406us
mean                     9420.275us         62.500us                1       9420.275us         62.500us

Have you tried not using a DataLoader …? if the dataset is 1D, and sufficiently small, you could just preload it in the GPU. Then write a custom loader that directly uses indices into your TensorDataset , should significantly speed up your training epochs.

Sorry for late post but can you tell something more how it should look like?

  • given some dataset X, sufficiently small
  • create TensorDataset : by converting X to torch.tensors, loading into your gpu (.to(your_gpu)), and using the build in class
  • create an custom loading class , i.e. just a class that iterates through the TensorDataset.
  • done
1 Like

Is there an example of this you can point to? My dataset is roughly 1.5GB and seems like it would fit entirely on GPU. I’m currently using DataLoader to feed minibatches to the GPU. I’m a newb at pytorch, but it seems like if the Dataloader (or some equivalent) as well as the model were on the GPU, things would go much quicker.

1 Like

@ptrblck
is there a way to give the whole dataloader to gpu (if it has enough memory) after we get our dataloader like this:

train_loader = DataLoader(dataset, batch_size=5000, shuffle=True, drop_last=False)

I am gonna iterate through train_loader and do batch.to(device) every iteration.
is there a way to move all data to gpu so I dont have to do .to(device) in iteration?

1 Like

If you are preloading the data in your Dataset, you could directly push it to the GPU, e.g. in the __init__ method.
Usually you are lazily loading and processing each sample, which means you would have to push each sample to the device.

4 Likes

In case of multi gpu, can we still do this? I have two gpus, each has enough memory to load the data into the gpu before training. If I load the data and train it with single gpu, the gpu utilization is 25% higher than loading from cpu at each batch. However, if I load to gpu and train it with two gpus the performance is worse than loading from cpu. I guess this happens because it’s moving data from gpu0 to gpu1 which becomes a bottleneck.

If you are using nn.DistributedDataParallel, each process could only load the (subset of the) data. nn.DataParallel creates model replica on each device for each forward pass, splits the data tensor in the batch dimension (dim0) and sends a chunk of the data to each device.
I’m not sure which approach you are using, but DDP should be faster.

I am currently using nn.DataParllel but I will give DDP a try.

I believe I am facing the same issue. Can you point me to an example of how load all training data to the GPU in the data loader? My images are currently on disk in labeled subdirectories (about 25,000 images).

Also - if the push to GPU is implemented within the data loader, I assume that this removes the ability to be device-agnostic, right?

(BTW, I have only one GPU device.)

If you push the complete data to the GPU, you could still use a DataLoader for batching and shuffling, but multiple workers won’t do much (and might yield errors for multiple CUDA context initializations).

Anyway, the easiest approach would be to load your data beforehand, push it to the GPU via:

data = data.to('cuda')
target = target.to('cuda')

and create a TensorDataset.
Once this is done, you could wrap it into a DataLoader with num_workers=0 and train your model.

I don’t know how large each image is, but assuming you are using images of the shape [3, 224, 224], a dataset of 25000 images will takes approx. 25000*3*224*224*4/1024**3 = 14GB of GPU memory.
So you should consider, if you want to spare that much memory for the data alone.

2 Likes