You would have to push it onto the GPU with
data = data.to('cuda'). Then you can check the device using
You would have to push it onto the GPU with
cuda:0 as output of
print(data.device), does it mean all data are already in GPU memory? If so, what might be the reason that dataloader takes 70% of the computation time?
Yes, it is on the first GPU device.
How do you time your code? Could you share a small example?
Are you processing the data somehow in your
Following is the main part of my code (for simplicity I remove implementation of model and loss function):
dataset = TensorDataset(data_1, data_2) train_loader = DataLoader(dataset, batch_size=5000, shuffle=True, drop_last=False) optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0) epochs = 100 for index_epoch in range(epochs): print index_epoch for x_1, x_2 in train_loader: optimizer.zero_grad() z_1 = model(x_1) # model is the neural network model z_2 = model(x_2) loss = get_loss(z_1, z_2) loss.backward() # print loss_list optimizer.step()
I time it using python cProfile, and it turns out 70% of time is on
dataloader.py, while only <10% on actual backward propagation. There is no additional processing of my data, I simply load it with
DataLoader. And the shape of my data is (5000000, 1).
Thanks for the information!
How big is your model? Could you profile it again with torch.utils.bottleneck?
Since CUDA operations are run asynchronously, your
DataLoader might have to wait for the CUDA op to finish, thus reporting a false number.
An alternative would be to add
torch.cuda.synchronize() calls before and after the forward and backward calls, and time it manually.
Sorry for the late reply! I used bottleneck module to profile it, and I still see that most of running time is spent on loading data and time on backward is quite small:
------------------ --------------- --------------- --------------- --------------- --------------- Name CPU time CUDA time Calls CPU total CUDA total ------------------ --------------- --------------- --------------- --------------- --------------- stack 878842.491us 1013487.305us 1 878842.491us 1013487.305us stack 868481.575us 998066.406us 1 868481.575us 998066.406us stack 861072.974us 1006662.109us 1 861072.974us 1006662.109us stack 860799.906us 995949.219us 1 860799.906us 995949.219us stack 216507.028us 249775.391us 1 216507.028us 249775.391us stack 213380.171us 247549.805us 1 213380.171us 247549.805us ExpandBackward 20528.386us 114.746us 1 20528.386us 114.746us sum 20522.516us 110.840us 1 20522.516us 110.840us _sum 20509.686us 102.051us 1 20509.686us 102.051us ExpandBackward 20494.596us 50.781us 1 20494.596us 50.781us sum 20489.766us 46.875us 1 20489.766us 46.875us _sum 20479.066us 41.016us 1 20479.066us 41.016us mean 11524.652us 51.270us 1 11524.652us 51.270us mean 11162.832us 66.406us 1 11162.832us 66.406us mean 9420.275us 62.500us 1 9420.275us 62.500us
Have you tried not using a
DataLoader …? if the dataset is 1D, and sufficiently small, you could just preload it in the GPU. Then write a custom loader that directly uses indices into your
TensorDataset , should significantly speed up your training epochs.
Sorry for late post but can you tell something more how it should look like?
- given some dataset X, sufficiently small
TensorDataset: by converting X to
torch.tensors, loading into your gpu (
.to(your_gpu)), and using the build in class
- create an custom loading class , i.e. just a class that iterates through the
Is there an example of this you can point to? My dataset is roughly 1.5GB and seems like it would fit entirely on GPU. I’m currently using DataLoader to feed minibatches to the GPU. I’m a newb at pytorch, but it seems like if the Dataloader (or some equivalent) as well as the model were on the GPU, things would go much quicker.
is there a way to give the whole dataloader to gpu (if it has enough memory) after we get our dataloader like this:
train_loader = DataLoader(dataset, batch_size=5000, shuffle=True, drop_last=False)
I am gonna iterate through train_loader and do batch.to(device) every iteration.
is there a way to move all data to gpu so I dont have to do .to(device) in iteration?
If you are preloading the data in your
Dataset, you could directly push it to the GPU, e.g. in the
Usually you are lazily loading and processing each sample, which means you would have to push each sample to the device.
In case of multi gpu, can we still do this? I have two gpus, each has enough memory to load the data into the gpu before training. If I load the data and train it with single gpu, the gpu utilization is 25% higher than loading from cpu at each batch. However, if I load to gpu and train it with two gpus the performance is worse than loading from cpu. I guess this happens because it’s moving data from gpu0 to gpu1 which becomes a bottleneck.
If you are using
nn.DistributedDataParallel, each process could only load the (subset of the) data.
nn.DataParallel creates model replica on each device for each forward pass, splits the data tensor in the batch dimension (dim0) and sends a chunk of the data to each device.
I’m not sure which approach you are using, but DDP should be faster.
I am currently using
nn.DataParllel but I will give DDP a try.
I believe I am facing the same issue. Can you point me to an example of how load all training data to the GPU in the data loader? My images are currently on disk in labeled subdirectories (about 25,000 images).
Also - if the push to GPU is implemented within the data loader, I assume that this removes the ability to be device-agnostic, right?
(BTW, I have only one GPU device.)
If you push the complete data to the GPU, you could still use a
DataLoader for batching and shuffling, but multiple workers won’t do much (and might yield errors for multiple CUDA context initializations).
Anyway, the easiest approach would be to load your data beforehand, push it to the GPU via:
data = data.to('cuda') target = target.to('cuda')
and create a
Once this is done, you could wrap it into a
num_workers=0 and train your model.
I don’t know how large each image is, but assuming you are using images of the shape
[3, 224, 224], a dataset of 25000 images will takes approx.
25000*3*224*224*4/1024**3 = 14GB of GPU memory.
So you should consider, if you want to spare that much memory for the data alone.
Thank you. In your code snippet, what is “data”? I mean, what form is it in/ how is it initialized?
The images are gray scale - but the raw images are 1000x1000 so the full dataset is more than 20 GB. During training, they’re subsampled down to 32x32, though. Maybe I can figure out how to do this once on the CPU, then send the subsampled data to the GPU. Would that work?
I will check out TensorDataset.
target would be returned by the
DataLoader in the typical loop:
for data, target in loader: data = data.to('cuda') target = target.to('cuda')
Ah OK, that would change the memory footprint to
25000*32*32*4/1024**2 = 97.7MB.
Yes that should work. You could iterate the
Dataset once, loading and resizing each sample in its
__getitem__ method and appending these samples to a list.
Once this is finished, you can use
data_all = torch.stack(data_list) to create a tensor and save it via
In your training, you would reload these samples using
torch.load and push it to the device.
Note however, that this approach would limit your ability to apply data augmentation, as most of the
torchvision.transforms are implemented to
PIL.Images, which are using numpy arrays under the hood.
Let me know, if that would work.
Can the images be sent to the gpu as PIL? (Rather than as tensors?) Basically, I’d split the transformation pipeline - subsample the PIL, send to the gpu, do the augmentation and tensor conversion on the gpu. Not that I really know how to do this yet ;^) but I’d like to know if this is a feasible approach.
Sorry if my questions are basic. It does seem that this issue of really large datasets must come up all the time in ML applications.