Hi, I am using a set of 1D data for training and I noticed that GPU usage is quite low (<5%) and training takes very long time to finish. I profiled my code (following instruction here: https://www.sagivtech.com/2017/09/19/optimizing-pytorch-training-code/) and found that 73% of the computation time is spent on loading data. I wonder if it is possible to load all data into GPU memory to speed up training, and tried to include pin_memory=True in my code, but it told me “cannot pin ‘torch.cuda.FloatTensor’ only CPU memory can be pinned”. Does anyone have idea how I should do this?
To speed up the transfer of data from the host to your device you could use pin_memory=True. However, if your data already is on the GPU you don’t need this function and it will throw an error.
If you have enough GPU memory to hold the data and model this is surely a valid approach to save the loading overhead.
I got cuda:0 as output of print(data.device), does it mean all data are already in GPU memory? If so, what might be the reason that dataloader takes 70% of the computation time?
Following is the main part of my code (for simplicity I remove implementation of model and loss function):
dataset = TensorDataset(data_1, data_2)
train_loader = DataLoader(dataset, batch_size=5000, shuffle=True, drop_last=False)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0)
epochs = 100
for index_epoch in range(epochs):
print index_epoch
for x_1, x_2 in train_loader:
optimizer.zero_grad()
z_1 = model(x_1) # model is the neural network model
z_2 = model(x_2)
loss = get_loss(z_1, z_2)
loss.backward()
# print loss_list
optimizer.step()
I time it using python cProfile, and it turns out 70% of time is on dataloader.py, while only <10% on actual backward propagation. There is no additional processing of my data, I simply load it with TensorDataset and DataLoader. And the shape of my data is (5000000, 1).
Sorry for the late reply! I used bottleneck module to profile it, and I still see that most of running time is spent on loading data and time on backward is quite small:
------------------ --------------- --------------- --------------- --------------- ---------------
Name CPU time CUDA time Calls CPU total CUDA total
------------------ --------------- --------------- --------------- --------------- ---------------
stack 878842.491us 1013487.305us 1 878842.491us 1013487.305us
stack 868481.575us 998066.406us 1 868481.575us 998066.406us
stack 861072.974us 1006662.109us 1 861072.974us 1006662.109us
stack 860799.906us 995949.219us 1 860799.906us 995949.219us
stack 216507.028us 249775.391us 1 216507.028us 249775.391us
stack 213380.171us 247549.805us 1 213380.171us 247549.805us
ExpandBackward 20528.386us 114.746us 1 20528.386us 114.746us
sum 20522.516us 110.840us 1 20522.516us 110.840us
_sum 20509.686us 102.051us 1 20509.686us 102.051us
ExpandBackward 20494.596us 50.781us 1 20494.596us 50.781us
sum 20489.766us 46.875us 1 20489.766us 46.875us
_sum 20479.066us 41.016us 1 20479.066us 41.016us
mean 11524.652us 51.270us 1 11524.652us 51.270us
mean 11162.832us 66.406us 1 11162.832us 66.406us
mean 9420.275us 62.500us 1 9420.275us 62.500us
Have you tried not using a DataLoader …? if the dataset is 1D, and sufficiently small, you could just preload it in the GPU. Then write a custom loader that directly uses indices into your TensorDataset , should significantly speed up your training epochs.
Is there an example of this you can point to? My dataset is roughly 1.5GB and seems like it would fit entirely on GPU. I’m currently using DataLoader to feed minibatches to the GPU. I’m a newb at pytorch, but it seems like if the Dataloader (or some equivalent) as well as the model were on the GPU, things would go much quicker.
I am gonna iterate through train_loader and do batch.to(device) every iteration.
is there a way to move all data to gpu so I dont have to do .to(device) in iteration?
If you are preloading the data in your Dataset, you could directly push it to the GPU, e.g. in the __init__ method.
Usually you are lazily loading and processing each sample, which means you would have to push each sample to the device.
In case of multi gpu, can we still do this? I have two gpus, each has enough memory to load the data into the gpu before training. If I load the data and train it with single gpu, the gpu utilization is 25% higher than loading from cpu at each batch. However, if I load to gpu and train it with two gpus the performance is worse than loading from cpu. I guess this happens because it’s moving data from gpu0 to gpu1 which becomes a bottleneck.
If you are using nn.DistributedDataParallel, each process could only load the (subset of the) data. nn.DataParallel creates model replica on each device for each forward pass, splits the data tensor in the batch dimension (dim0) and sends a chunk of the data to each device.
I’m not sure which approach you are using, but DDP should be faster.
I believe I am facing the same issue. Can you point me to an example of how load all training data to the GPU in the data loader? My images are currently on disk in labeled subdirectories (about 25,000 images).
Also - if the push to GPU is implemented within the data loader, I assume that this removes the ability to be device-agnostic, right?
If you push the complete data to the GPU, you could still use a DataLoader for batching and shuffling, but multiple workers won’t do much (and might yield errors for multiple CUDA context initializations).
Anyway, the easiest approach would be to load your data beforehand, push it to the GPU via:
data = data.to('cuda')
target = target.to('cuda')
and create a TensorDataset.
Once this is done, you could wrap it into a DataLoader with num_workers=0 and train your model.
I don’t know how large each image is, but assuming you are using images of the shape [3, 224, 224], a dataset of 25000 images will takes approx. 25000*3*224*224*4/1024**3 = 14GB of GPU memory.
So you should consider, if you want to spare that much memory for the data alone.