How to prefetch data when processing with GPU?

Hi everyone,

I’m new to Pytorch/torch. Is there any deme codes for prefetching data with another process during GPU doing computation?

Thank you very much.



Pytorch provides the dataloader for some common vision datasets.

You could also follow what’s in, and build you own.

Got it! thank you very much!

To complement @ruotianluo’s answer, here you can find an example usage of data loaders. Also keep in mind that they work with all kinds of datasets, not only vision! (A dataset is just an object that defines two methods as in here).

1 Like

@apaszke I wonder if there’s a prefetch option on GPU in PyTorch, just as in Caffe For example, my model is small, and I can trade GPU memory for using fewer workers or less fast HDD. Is this possible in PyTorch? This is especially an issue if I train multiple small models on PyTorch on separate cards of a single machine, as I run out of workers.

1 Like

there isn’t a prefetch option, but you can write a custom Dataset that just loads the entire data on GPU and returns samples from in-memory. In that case you can just use 0 workers in your DataLoader

1 Like

Thanks. I want a prefetching option when dealing with dataset like ImageNet, and it should work with multi GPU. The approach you proposed would not work in that case.

1 Like

if there is no prefetch, the GPUs will not be fully occupied due to waiting for data loading and preprocessing. is this understanding correct?

1 Like

we already have prefetch (see the imagenet or dcgan examples), but we dont prefetch directly onto the GPU. We prefetch onto CPU, do data augmentation and then we put the mini-batch in CUDA pinned memory (on CPU) so that GPU transfer is very fast. Then we give data to network to transfer to GPU and train.


Using prefetch seems to decrease speed in my case. I can run ~100 examples/second using num_workers = 0. But only run ~50 examples/second using num_workers = 1. The more workers, the slower the speed. Is there any reasons behind this?


Same situation in my code. The dataloader can’t make full use of the cpus.


Please let me know if I can make the full use of my CPU. I tried overclocking my GPU

1 Like

Is data augmentation done in a single thread even when data loader is multi-threaded with num_workers > 2?

1 Like

Same situation.

I have 12 cores, I find that the dataloader can not make full use of them. It can only make about 60% ability of them. I set the num_worker as 12 and my IO bound is enough. For training network like resnet, the data preprocess time cost more than 70% time in the whole process.

I have a computation bound augment process, and I set num_workers=50.
Then there’s a obvious delay in data loading every 50 batches. It seems that each worker prepare only one batch and wait still until next invocation. CPU and IO utilities are all low.

I use the imagenet example.
Here is some logging.

Epoch: [0][0/1124]      Time 100.303 (100.303)  Data 95.814 (95.814)    Loss 2.1197 (2.1197)    Acc@1 1.562 (1.562)     Acc@2 51.562 (51.562)
Epoch: [0][10/1124]     Time 0.422 (9.584)      Data 0.011 (8.753)      Loss 2.2578 (2.1322)    Acc@1 4.297 (4.865)     Acc@2 6.250 (20.312)
Epoch: [0][20/1124]     Time 0.288 (5.191)      Data 0.078 (4.606)      Loss 2.3653 (2.1892)    Acc@1 16.406 (7.310)    Acc@2 63.281 (21.001)
Epoch: [0][30/1124]     Time 0.288 (3.604)      Data 0.071 (3.134)      Loss 2.2124 (2.1810)    Acc@1 23.047 (10.144)   Acc@2 46.094 (28.679)
Epoch: [0][40/1124]     Time 0.295 (2.795)      Data 0.094 (2.382)      Loss 2.2084 (2.1984)    Acc@1 2.344 (11.719)    Acc@2 3.906 (30.164)
Epoch: [0][50/1124]     Time 0.434 (2.312)      Data 0.098 (1.927)      Loss 2.0883 (2.1817)    Acc@1 27.344 (11.060)   Acc@2 50.391 (28.592)
Epoch: [0][60/1124]     Time 30.907 (2.497)     Data 30.285 (2.119)     Loss 2.0824 (2.1731)    Acc@1 19.531 (12.897)   Acc@2 41.406 (31.276)
Epoch: [0][70/1124]     Time 0.816 (2.326)      Data 0.102 (1.933)      Loss 2.0904 (2.1663)    Acc@1 22.656 (14.277)   Acc@2 43.359 (33.291)
Epoch: [0][80/1124]     Time 0.192 (2.094)      Data 0.065 (1.707)      Loss 2.1167 (2.1607)    Acc@1 21.094 (15.355)   Acc@2 42.188 (34.891)
Epoch: [0][90/1124]     Time 0.788 (1.904)      Data 0.168 (1.528)      Loss 2.2529 (2.1558)    Acc@1 25.781 (16.303)   Acc@2 46.484 (36.178)
Epoch: [0][100/1124]    Time 0.324 (1.749)      Data 0.212 (1.385)      Loss 2.1106 (2.1537)    Acc@1 21.484 (16.901)   Acc@2 48.438 (37.233)
Epoch: [0][110/1124]    Time 0.414 (1.633)      Data 0.002 (1.267)      Loss 2.0465 (2.1491)    Acc@1 23.438 (17.325)   Acc@2 48.438 (37.968)
Epoch: [0][120/1124]    Time 45.406 (1.906)     Data 44.589 (1.537)     Loss 2.2800 (2.1496)    Acc@1 20.703 (17.859)   Acc@2 42.578 (38.598)
Epoch: [0][130/1124]    Time 0.591 (1.824)      Data 0.007 (1.454)      Loss 2.0338 (2.1466)    Acc@1 19.141 (18.079)   Acc@2 45.703 (39.054)
Epoch: [0][140/1124]    Time 0.510 (1.765)      Data 0.184 (1.397)      Loss 2.1249 (2.1457)    Acc@1 21.875 (18.426)   Acc@2 49.609 (39.583)
Epoch: [0][150/1124]    Time 0.203 (1.670)      Data 0.004 (1.308)      Loss 2.1863 (2.1450)    Acc@1 20.703 (18.755)   Acc@2 42.188 (39.929)
Epoch: [0][160/1124]    Time 0.269 (1.589)      Data 0.084 (1.231)      Loss 2.2051 (2.1434)    Acc@1 23.828 (19.031)   Acc@2 48.047 (40.302)
Epoch: [0][170/1124]    Time 0.281 (1.520)      Data 0.077 (1.163)      Loss 2.1192 (2.1403)    Acc@1 21.875 (19.301)   Acc@2 44.531 (40.634)
Epoch: [0][180/1124]    Time 40.498 (1.675)     Data 39.783 (1.321)     Loss 2.0573 (2.1410)    Acc@1 24.609 (19.572)   Acc@2 47.656 (40.981)

I am experiencing the same issue. No matter how many data loading workers I select, there’s always a delay after their batches have been processed.

have the same issue too.
using tfrecord or mxnet’s rec file to load data doesn’t have this issue.

Same situation.
I moved from tensorflow’s Dataset API that works like a charm.

But dataloader loads batch and waits for model optimization (but it shouldn’t wait!)


I’ve spent some time working on this problem over a variety of projects. I’ve cut and pasted some past thoughts, bullets, etc from previous discussions. My background involves architecting systems that move large volumes of data from network cards to storage and then back again on request, with processing between the steps. A very similar set of concerns.

The two main constraints that usually dominate your PyTorch training performance and ability to saturate the shiny GPUs are your total CPU IPS (instructions per second) and your storage IOPS (I/O per second).

You want the CPUs to be performing preprocessing, decompression, and copying – to get the data to the GPU. You don’t want them to be idling or busy-waiting for thread/process synchronization primitives, IO, etc. The easiest way to improve CPU utilization with the PyTorch is to use the worker process support built into Dataloader. The preprocessing that you do in using those workers should use as much native code and as little Python as possible. Use Numpy, PyTorch, OpenCV and other libraries with efficient vectorized routines that are written in C/C++. Looping through your data byte by byte in Python will kill your performance, massacre the memory allocator, etc.

With most common use cases, the preprocessing is done well enough to not be an issue. Things tend to fall apart dramatically hitting the IOPS limit of your storage. Most simple PyTorch datasets tend to use media stored in individual files.

  • Modern filesystems are good, but when you have thousands of small files and you’re trying to move GB/s of data, reading each file individually can saturate your IOPS long before you can ever maximize GPU or CPU utilization.
  • Just opening a file by name with PIL can be an alarming number of disk seeks (profile it)
  • Quick fix: buy an NVME SSD drive, or two.
  • SATA SSD is not necessarily going to cut it, you can saturate them with small to medium image files + default loader/dataset setups feeding multiple GPUs.
  • Magnetic drives are going to fall on their face
  • If you are stuck with certain drives or max out the best, the next move requires more advanced caching, prefetching, on-disk format – move to an index/manifest + record based bulk data (like tfrecord or RecordIO) or an efficient memory-mapped/paged in-process DB
  • I’ve leveraged LMDB successfully with PyTorch and a custom simplification of the Python LMDB module. My branch here ( I didn’t document or explain what I did there or why, ask if curious.
  • Beyond an optimal number (experiment!), throwing more worker processes at the IOPS barrier WILL NOT HELP, it’ll make it worse. You’ll have more processes trying to read files at the same time, and you’ll be increasing the shared memory consumption by significant amounts for additional queuing, thus increasing the paging load on the system and possibly taking you into thrashing territory that the system may never recover from
  • Once you have saturated the IOPS of your storage or taxed the memory subsystem and entered into a thrashing situation, it won’t look like you’re doing a whole lot. There will be a lot of threads/processes (including kernel ones) basically not doing much besides waiting for IO, page faults, etc. Behaviour will usually be sporadic and bursty once you cross the line of what can be sustained by your system, much like network link utilization without flow control (queuing theory).

Other pointers for a fast training setup with minimal work over the defaults:

  • Employ some of the optimizations in NVIDIA’s examples ( NVIDIA’s fast_collate and prefetch loader w/ GPU normalization step do help a bit.
  • I’ve seen big gains over torch.DataParallel using apex.DistributedDataParallel. Moving from ‘one main process + worker process + multiple-GPU with DataParallel’ to 'one process-per GPU with apex (and presumably torch)
  • DistributedDataParallel has always improved performance for me. Remember to (down)scale your worker processes per training process accordingly. Higher GPU utilization and less waiting for synchronization usually results, the variance in batch times will reduce with the average time moving closer to the peak.
  • Use SIMD fork of Pillow with default PyTorch transforms, or write your own OpenCV image processing and loading routines
  • Don’t leave the dataloader pin_memory=‘True’ on by default in your code. There was a reason why PyTorch authors left it as False. I’ve run into many situations where True definitely does cause extremely negative paging/memory subsystem impact . Try both.

An observation on the tfrecord/recordio chunking. For IO, even flash based, randomness is bad, sequential chunks are good. Hugely so when you have to move physical disk heads. The random/shuffled nature of training is thus worst case. When you see gains using record/chunked data, it’s largely due to the fact that you read data in sequential chunks. This comes with a penalty. You’ve constrained how random your training samples can be from epoch to epoch. With tfrecord, you usually shuffle once when you build the the tfrecord chunks. When training, you can only shuffle again within some number of queued record files (constrained by memory), not across the whole dataset. In many situations, this isn’t likely to cause problems, but you do have to be aware for each use case, tune your record size to balance performance with how many samples you’d like to shuffle at a time.

Dataloaders and Cuda management
Pytorch cuda training speed comparison between data on Nvme vs Sata ssd
Non-blocking transfer to GPU is not working
Input numpy ndarray instead of images in a CNN
My GPU is dead while using Nvidia Apex
How to make current data loader more efficient
Why the more num_work the time for inference is amazingly longer?
Runtime slowdown related to number of datasets
Iterating batch in DataLoader takes more time after every nth iteration
Bottleneck on data loading
Dataloader caching on large datasets
Pytorch is not using GPU
Training loop takes a long time each epoch using TensorDataset
Dataloaders and performance
Torch has not attribute load_state_dict?
How to increase GPU utlization
Define iterator on Dataloader is very slow
Dataloader SuperSuper Slow
Speed up CNN pytorch
Long training time, bottleneck output shows high IO
Loading a large dataset and training them
GPU not fully used, how to optimize the code
GPU memory is in normal use, but GPU-util is 0%
Execution time does not decrease as batch size increases with GPUs
CPU maxed out on training resnext50_32x4d....while gpu not being used hence slow training
DataLoader implicitly using CPU?
Speed up image loading in CPU before transferring to GPU
Handling large 3d image dataset with DataLoader
Questions after following transfer learning tutorial
Low GPU Util with Custom Dataloader Open CV and Numpy Preprocessing
Effect of python eval() function on GPU training
Possible deadlock? Training example gets stuck
Data loading time is nearly proportional to batchsize
Data loading with massive batches?
NumpyDataset - Performance Analysis
Time/Memory keeps increasing at every iteration
Is a Dataset copied as part of Dataloader with multiple workers?
My model is is not using GPU after successful conversion though
Best practices when reading a large number of files every dataloader iteration
CPU bottlecking GPU
Training faster with single gpu
DataLoader hangs with custom DataSet
Is there any efficient way to load MS1M dataset?
How to speed up training on ImageNet
DataLoader efficiency with multiple workers
Loading data taking a lot of time
Can multiple training runs, all reading the same data on disk, slow each other down?
ZERO GPU utilization
Can't get pytorch to use my 2070 GPU
Volatile GPU util 0% with high memory usage
More gpu accelerate the speed of dataloader?
Run Pytorch on Multiple GPUs

Thanks for this beautiful explanation and digging into some bottlenecks!
It’s surely a great read for a lot of users here! :wink: