How to load all data into GPU for training

(Wei Chen) #1

Hi, I am using a set of 1D data for training and I noticed that GPU usage is quite low (<5%) and training takes very long time to finish. I profiled my code (following instruction here: and found that 73% of the computation time is spent on loading data. I wonder if it is possible to load all data into GPU memory to speed up training, and tried to include pin_memory=True in my code, but it told me “cannot pin ‘torch.cuda.FloatTensor’ only CPU memory can be pinned”. Does anyone have idea how I should do this?

Thank you!


To speed up the transfer of data from the host to your device you could use pin_memory=True. However, if your data already is on the GPU you don’t need this function and it will throw an error.
If you have enough GPU memory to hold the data and model this is surely a valid approach to save the loading overhead.

(Wei Chen) #3


How do I know if my data are in GPU memory?


You would have to push it onto the GPU with data ='cuda'). Then you can check the device using print(data.device).

(Wei Chen) #5

I got cuda:0 as output of print(data.device), does it mean all data are already in GPU memory? If so, what might be the reason that dataloader takes 70% of the computation time?


Yes, it is on the first GPU device.
How do you time your code? Could you share a small example?
Are you processing the data somehow in your Dataset?

(Wei Chen) #7

Following is the main part of my code (for simplicity I remove implementation of model and loss function):

dataset = TensorDataset(data_1, data_2)
train_loader = DataLoader(dataset, batch_size=5000, shuffle=True, drop_last=False)

optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0)
epochs = 100
for index_epoch in range(epochs):
    print index_epoch
    for x_1, x_2 in train_loader:
        z_1 = model(x_1)  # model is the neural network model
        z_2 = model(x_2)
        loss = get_loss(z_1, z_2)
        # print loss_list

I time it using python cProfile, and it turns out 70% of time is on, while only <10% on actual backward propagation. There is no additional processing of my data, I simply load it with TensorDataset and DataLoader. And the shape of my data is (5000000, 1).


Thanks for the information!
How big is your model? Could you profile it again with torch.utils.bottleneck?

Since CUDA operations are run asynchronously, your DataLoader might have to wait for the CUDA op to finish, thus reporting a false number.

An alternative would be to add torch.cuda.synchronize() calls before and after the forward and backward calls, and time it manually.

(Wei Chen) #9

Sorry for the late reply! I used bottleneck module to profile it, and I still see that most of running time is spent on loading data and time on backward is quite small:

------------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       CPU time        CUDA time            Calls        CPU total       CUDA total
------------------  ---------------  ---------------  ---------------  ---------------  ---------------
stack                  878842.491us    1013487.305us                1     878842.491us    1013487.305us
stack                  868481.575us     998066.406us                1     868481.575us     998066.406us
stack                  861072.974us    1006662.109us                1     861072.974us    1006662.109us
stack                  860799.906us     995949.219us                1     860799.906us     995949.219us
stack                  216507.028us     249775.391us                1     216507.028us     249775.391us
stack                  213380.171us     247549.805us                1     213380.171us     247549.805us
ExpandBackward          20528.386us        114.746us                1      20528.386us        114.746us
sum                     20522.516us        110.840us                1      20522.516us        110.840us
_sum                    20509.686us        102.051us                1      20509.686us        102.051us
ExpandBackward          20494.596us         50.781us                1      20494.596us         50.781us
sum                     20489.766us         46.875us                1      20489.766us         46.875us
_sum                    20479.066us         41.016us                1      20479.066us         41.016us
mean                    11524.652us         51.270us                1      11524.652us         51.270us
mean                    11162.832us         66.406us                1      11162.832us         66.406us
mean                     9420.275us         62.500us                1       9420.275us         62.500us


Have you tried not using a DataLoader …? if the dataset is 1D, and sufficiently small, you could just preload it in the GPU. Then write a custom loader that directly uses indices into your TensorDataset , should significantly speed up your training epochs.

(Rick Sanchez) #11

Sorry for late post but can you tell something more how it should look like?

  • given some dataset X, sufficiently small
  • create TensorDataset : by converting X to torch.tensors, loading into your gpu (.to(your_gpu)), and using the build in class
  • create an custom loading class , i.e. just a class that iterates through the TensorDataset.
  • done