How to prefetch data when processing with GPU?

Same situation.

I have 12 cores, I find that the dataloader can not make full use of them. It can only make about 60% ability of them. I set the num_worker as 12 and my IO bound is enough. For training network like resnet, the data preprocess time cost more than 70% time in the whole process.

1 Like

I have a computation bound augment process, and I set num_workers=50.
Then there’s a obvious delay in data loading every 50 batches. It seems that each worker prepare only one batch and wait still until next invocation. CPU and IO utilities are all low.

I use the imagenet example.
Here is some logging.

Epoch: [0][0/1124]      Time 100.303 (100.303)  Data 95.814 (95.814)    Loss 2.1197 (2.1197)    Acc@1 1.562 (1.562)     Acc@2 51.562 (51.562)
Epoch: [0][10/1124]     Time 0.422 (9.584)      Data 0.011 (8.753)      Loss 2.2578 (2.1322)    Acc@1 4.297 (4.865)     Acc@2 6.250 (20.312)
Epoch: [0][20/1124]     Time 0.288 (5.191)      Data 0.078 (4.606)      Loss 2.3653 (2.1892)    Acc@1 16.406 (7.310)    Acc@2 63.281 (21.001)
Epoch: [0][30/1124]     Time 0.288 (3.604)      Data 0.071 (3.134)      Loss 2.2124 (2.1810)    Acc@1 23.047 (10.144)   Acc@2 46.094 (28.679)
Epoch: [0][40/1124]     Time 0.295 (2.795)      Data 0.094 (2.382)      Loss 2.2084 (2.1984)    Acc@1 2.344 (11.719)    Acc@2 3.906 (30.164)
Epoch: [0][50/1124]     Time 0.434 (2.312)      Data 0.098 (1.927)      Loss 2.0883 (2.1817)    Acc@1 27.344 (11.060)   Acc@2 50.391 (28.592)
Epoch: [0][60/1124]     Time 30.907 (2.497)     Data 30.285 (2.119)     Loss 2.0824 (2.1731)    Acc@1 19.531 (12.897)   Acc@2 41.406 (31.276)
Epoch: [0][70/1124]     Time 0.816 (2.326)      Data 0.102 (1.933)      Loss 2.0904 (2.1663)    Acc@1 22.656 (14.277)   Acc@2 43.359 (33.291)
Epoch: [0][80/1124]     Time 0.192 (2.094)      Data 0.065 (1.707)      Loss 2.1167 (2.1607)    Acc@1 21.094 (15.355)   Acc@2 42.188 (34.891)
Epoch: [0][90/1124]     Time 0.788 (1.904)      Data 0.168 (1.528)      Loss 2.2529 (2.1558)    Acc@1 25.781 (16.303)   Acc@2 46.484 (36.178)
Epoch: [0][100/1124]    Time 0.324 (1.749)      Data 0.212 (1.385)      Loss 2.1106 (2.1537)    Acc@1 21.484 (16.901)   Acc@2 48.438 (37.233)
Epoch: [0][110/1124]    Time 0.414 (1.633)      Data 0.002 (1.267)      Loss 2.0465 (2.1491)    Acc@1 23.438 (17.325)   Acc@2 48.438 (37.968)
Epoch: [0][120/1124]    Time 45.406 (1.906)     Data 44.589 (1.537)     Loss 2.2800 (2.1496)    Acc@1 20.703 (17.859)   Acc@2 42.578 (38.598)
Epoch: [0][130/1124]    Time 0.591 (1.824)      Data 0.007 (1.454)      Loss 2.0338 (2.1466)    Acc@1 19.141 (18.079)   Acc@2 45.703 (39.054)
Epoch: [0][140/1124]    Time 0.510 (1.765)      Data 0.184 (1.397)      Loss 2.1249 (2.1457)    Acc@1 21.875 (18.426)   Acc@2 49.609 (39.583)
Epoch: [0][150/1124]    Time 0.203 (1.670)      Data 0.004 (1.308)      Loss 2.1863 (2.1450)    Acc@1 20.703 (18.755)   Acc@2 42.188 (39.929)
Epoch: [0][160/1124]    Time 0.269 (1.589)      Data 0.084 (1.231)      Loss 2.2051 (2.1434)    Acc@1 23.828 (19.031)   Acc@2 48.047 (40.302)
Epoch: [0][170/1124]    Time 0.281 (1.520)      Data 0.077 (1.163)      Loss 2.1192 (2.1403)    Acc@1 21.875 (19.301)   Acc@2 44.531 (40.634)
Epoch: [0][180/1124]    Time 40.498 (1.675)     Data 39.783 (1.321)     Loss 2.0573 (2.1410)    Acc@1 24.609 (19.572)   Acc@2 47.656 (40.981)

I am experiencing the same issue. No matter how many data loading workers I select, there’s always a delay after their batches have been processed.

have the same issue too.
using tfrecord or mxnet’s rec file to load data doesn’t have this issue.

Same situation.
I moved from tensorflow’s Dataset API that works like a charm.

But dataloader loads batch and waits for model optimization (but it shouldn’t wait!)


I’ve spent some time working on this problem over a variety of projects. I’ve cut and pasted some past thoughts, bullets, etc from previous discussions. My background involves architecting systems that move large volumes of data from network cards to storage and then back again on request, with processing between the steps. A very similar set of concerns.

The two main constraints that usually dominate your PyTorch training performance and ability to saturate the shiny GPUs are your total CPU IPS (instructions per second) and your storage IOPS (I/O per second).

You want the CPUs to be performing preprocessing, decompression, and copying – to get the data to the GPU. You don’t want them to be idling or busy-waiting for thread/process synchronization primitives, IO, etc. The easiest way to improve CPU utilization with the PyTorch is to use the worker process support built into Dataloader. The preprocessing that you do in using those workers should use as much native code and as little Python as possible. Use Numpy, PyTorch, OpenCV and other libraries with efficient vectorized routines that are written in C/C++. Looping through your data byte by byte in Python will kill your performance, massacre the memory allocator, etc.

With most common use cases, the preprocessing is done well enough to not be an issue. Things tend to fall apart dramatically hitting the IOPS limit of your storage. Most simple PyTorch datasets tend to use media stored in individual files.

  • Modern filesystems are good, but when you have thousands of small files and you’re trying to move GB/s of data, reading each file individually can saturate your IOPS long before you can ever maximize GPU or CPU utilization.
  • Just opening a file by name with PIL can be an alarming number of disk seeks (profile it)
  • Quick fix: buy an NVME SSD drive, or two.
  • SATA SSD is not necessarily going to cut it, you can saturate them with small to medium image files + default loader/dataset setups feeding multiple GPUs.
  • Magnetic drives are going to fall on their face
  • If you are stuck with certain drives or max out the best, the next move requires more advanced caching, prefetching, on-disk format – move to an index/manifest + record based bulk data (like tfrecord or RecordIO) or an efficient memory-mapped/paged in-process DB
  • I’ve leveraged LMDB successfully with PyTorch and a custom simplification of the Python LMDB module. My branch here ( I didn’t document or explain what I did there or why, ask if curious.
  • Beyond an optimal number (experiment!), throwing more worker processes at the IOPS barrier WILL NOT HELP, it’ll make it worse. You’ll have more processes trying to read files at the same time, and you’ll be increasing the shared memory consumption by significant amounts for additional queuing, thus increasing the paging load on the system and possibly taking you into thrashing territory that the system may never recover from
  • Once you have saturated the IOPS of your storage or taxed the memory subsystem and entered into a thrashing situation, it won’t look like you’re doing a whole lot. There will be a lot of threads/processes (including kernel ones) basically not doing much besides waiting for IO, page faults, etc. Behaviour will usually be sporadic and bursty once you cross the line of what can be sustained by your system, much like network link utilization without flow control (queuing theory).

Other pointers for a fast training setup with minimal work over the defaults:

  • Employ some of the optimizations in NVIDIA’s examples ( NVIDIA’s fast_collate and prefetch loader w/ GPU normalization step do help a bit.
  • I’ve seen big gains over torch.DataParallel using apex.DistributedDataParallel. Moving from ‘one main process + worker process + multiple-GPU with DataParallel’ to 'one process-per GPU with apex (and presumably torch)
  • DistributedDataParallel has always improved performance for me. Remember to (down)scale your worker processes per training process accordingly. Higher GPU utilization and less waiting for synchronization usually results, the variance in batch times will reduce with the average time moving closer to the peak.
  • Use SIMD fork of Pillow with default PyTorch transforms, or write your own OpenCV image processing and loading routines
  • Don’t leave the dataloader pin_memory=‘True’ on by default in your code. There was a reason why PyTorch authors left it as False. I’ve run into many situations where True definitely does cause extremely negative paging/memory subsystem impact . Try both.

An observation on the tfrecord/recordio chunking. For IO, even flash based, randomness is bad, sequential chunks are good. Hugely so when you have to move physical disk heads. The random/shuffled nature of training is thus worst case. When you see gains using record/chunked data, it’s largely due to the fact that you read data in sequential chunks. This comes with a penalty. You’ve constrained how random your training samples can be from epoch to epoch. With tfrecord, you usually shuffle once when you build the the tfrecord chunks. When training, you can only shuffle again within some number of queued record files (constrained by memory), not across the whole dataset. In many situations, this isn’t likely to cause problems, but you do have to be aware for each use case, tune your record size to balance performance with how many samples you’d like to shuffle at a time.


Thanks for this beautiful explanation and digging into some bottlenecks!
It’s surely a great read for a lot of users here! :wink:


This is such a goldmine of great thoughts and insights. Thanks for sharing!


We’ve been experimenting with a dataset which streams data from Azure Blob Storage real time (here in case someone is interested… bit of a work in progress though). Files in the blob storage should be available for massively scalable apps, so IOPS shouldn’t be a bottleneck. So if you just have enough CPUs/ lots of workers, in theory it should work even for a huge number of files. At least that was out hypothesis… we haven’t really been able to test with a large number of CPUs yet. In practice with a handful of CPUs we’ve had to resort to caching the data to SSD, but at least that approach is a nice compromise between first downloading the entire dataset and then training (fast) and just streaming from blob storage without caching (slow if not enough CPUs).

It would be very interesting to hear opinions about this kind of an approach from others (esp @rwightman).

1 Like

@harpone Yes, various blob storages can work, it’s a common setup for Tensorflow training w/ Google ml-engine. Datasets can be stored in GCS or BigTable, there are C++ ops wrapped in a Python API that allow that to be tied together with the TFRecord format and their data API. Edit: And thanks for sharing the azure blob dataset, curious how large each file for the scenario you’re targeting?

Cloud blob storage has its own IOPS equivalent (roughly, for this task) limitation, requests/sec. Even though the throughput on cloud to cloud network transfers are really high, requesting data over a network will have a much higher latency (bounded by a multiple of the RTT) than local storage. You have to design the system carefully to mitigate latency by increasing the size of, and reducing number of sequential requests, or by dispatching many async parallel requests. The easiest solution for those large blocks is to use a record format. For many parallel requests, i’d probably use an efficient/scalable distributed database than can store your data natively (ie binary as binary or text/JSON as text/JSON).

I wouldn’t want to write the requesting and parsing code for any of the above in Python though. You don’t actually need many CPUs if you aren’t stuck in Python, you want those CPUs to be doing more useful things. With a small pool of worker threads and a decent async (or at least non-blocking) IO subsystem, you can move and parse a lot of data efficiently using a systems language. This is hard to do effectively in Python.

This ‘Python for the interface and C/C++ for the dirty work’ paradigm is a little saddening when you realize you need to do some of the dirty work. I’m quite looking forward to see how the Tensorflow Swift experiment works out. If Swift gains momentum as both a viable server language and ML language, it could make some of the systems work for supporting ML much less of a chore. SwiftTorch FTW? :slight_smile:


If I want to quickly achieve a bit more scale for a training project in the cloud (without writing extra code) I usually just use GCE and the direct attached SSD. Unlike AWS, in GCE you can attach a single persistent SSD to multiple instances as long as you mount read-only on all.

I use scripts to move data from long term storage to a new persistent SSD, then attach read-only to multiple training instances at the same mount point. Delete the SSD when done that experiment.


I’ve been hoping to see someone mention “SwiftTorch” for a while. This would be a dream come true. I fell in love with the language while spending some time last year working on iOS based projects. If I get some free time in the next few months, I would like to play around with making a proof-of-concept front end for some of the torch functionality. There may be an easy way to do it by leveraging Objective C/C++ as a bridge. My plan is to wrap the PyTorch C++ API in objective C++ wrapper classes, which can be exposed directly to Swift.

I forgot to mention quite crucially that the idea was to generalize a “FolderDataset” to blob storage case, i.e. the data is just a bunch of JPEGs, exactly because we would prefer to keep the data as close to original as possible for debugging and other reasons…

It would indeed be nice to have a one language solution for all… you mentioned Swift, but how about Julia?

Thanks for the advice again @rwightman! GCE’s persistend data-SSD would sound really useful!

What about moving your dataset to dev/shm on Linux filesystems, could this help? Or is it detrimental?

I’m curious if you’ve taken a look at DALI?

I’m spending some time going through the code now to try to get a sense of what’s going on. It’s computer vision specific, but in another thread about speeding up the dataloader there were claims of an 8x speedup, although how much of that is coming from doing the image preprocessing on GPU I’m not sure.

I’m working with tabular data which it doesn’t support, but I’m trying to figure out some of their caching strategies and techniques so I can work it into the dataloader I’m working on to go along with RAPIDS.AI. (I work for NVidia on the Rapids team focused on deep learning for tabular data)

1 Like

DALI partially can solve your problems. The basic idea behind DALI was to move data processing to the GPU as CPU is not scaling that well as GPU, and the processing power ratio of GPU/CPU is increasing over time.

It provides :

  • GPU acceleration to image processing. So if the CPU is the bottleneck it will help
  • transparent way to prefetch data - you can select how many batches ahead you want to have stored in the pre-processing queue. It matters a lot when the processing time varies a lot batch 2 batch
  • an easy way to jump between different data storage formats without any need to write custom code for that. Currently, you can select between LMDB, to record, RecordIO, singe files and COCO.
  • moves data to the GPU in the background
  • does shuffling even for the sequential data formats - DALI keeps internal buffer that sequentially read data and then randomly samples it to format a batch

However, DALI will not:

  • help much if you are not CPU bound - your network is heavy and GPU is already occupied or you have big CPU/GPU ratio. Still, you can use DALI with CPU based operators for the flexibility and then when your network is lightweight just change the operator flavour form CPU to GPU
  • if you are IO bound then DALI won’t accelerate your disk access

@Even_Oldridge @JanuszL I’ve aware of DALI and familiar with it at a high level but I have not had a chance to try it. I’ve generally been able to find a good balance of CPU vs GPU utilization on image based projects without. I was going to take a look at the data loading pipeline a bit closer at some point. Definitely some goodies in there.

My philosophy is that if you can do the data preprocessing on the CPU without bumping into the limits, it’s best to do so and keep the GPU 100% busy learning. When the CPUs can no longer keep up, it seems like a good option to look at.

Where I see DALI providing a lot of value for me (in the future) is with video. H.264/H.265 isn’t feasible to decode on CPUs at a rate needed for DL. Especially if you’re trying to move around in larger segments with sporadic IDR frames. Decoding on the GPU into on card memory that can be fed directly into the NN is key there. I just haven’t hit that point yet :slight_smile:

Since we’ve got an author of DALI here, I’m curious, how much memory overhead is it to preprocess the images on the GPU taking into consideration the original image size to NN input size ratio? Is there a rough rule/formula? Naively, if one were to try orchestrate this from most frameworks in Python, you’d have to copy into GPU tensors at the original images’ full size before doing any downscale + augmentation. Thus using up a lot of valuable GPU memory. How efficient is DALI at this when it comes do doing the JPEG decode -> preprocess -> network input size?


@rwightman I totally agree that moving things to GPU should be done only when your CPU is overworked. It all depends on the dataset you have (the bigger images the more data processing, decoding especially), what DL framework you are using and how computational intensive is your network. I.e. for RN50 we can easily saturate GPU while for RN18 it will rather starve for the data.

Regarding the video support we are still at the very initial stage, we have a decoder and just basic operators to show how it works and what kind of use cases people has for that. We really count on some external contribution to help us extend support for such workloads.

Regarding memory, there is a lot of factors. Mostly it depends on how your pipeline looks like. DALI is not executing in place and is processing data in batches. So we need an intermediate buffer between every operator. So for every operator n we need Xn*Yn*N (N-number of sample in the batch). Some operators like decoder can have big images at the output and each is different, while some like crop will have their output with a fixed size. Also, DALI provides multiple buffering so we need to add space for the output buffers as well. So as the final formula I would count the worst case sizes of images in the batch at each operator output + prefetch queue depth number of output buffers. Also, some operators, like resize need some scratch buffer on their own. We are aware that DALI memory consumption is far from perfect and we want to improve it, but not sacrificing the performance. That is why this will be part of more significant architecture rework.
As a side note I can add that to improve a bit memory consumption (at least for the decoder) DALI provides (thanks to nvJPEG) ROI based decoding so you can save a bit of memory not decoding the whole image but just the part of it which will be processed later (as cropping is usually part of the processing pipeline).

Correct me if I’m wrong.

With num_workers > 1 , dataloader create num_workers threads for data loading. Each thread maintains a pool of batch_size images. Each time, dataloader just fetch images from a pool which is fully loaded. This causes several problems when batch_size is very large (1024,2048…).

  1. The memory usage is large. batch_size x num_workers x img_height x img_width x img_channels x 4(float). It can easily be 11GB or more : 1024202242243*4
  2. A obvious delay occurs every num_workers batch. Just like Weifeng shows.

Maybe we make all threads share a pool (a queue) of n*batch_size and all num_workers threads loads data into this pool. Each time the dataloader just pull batch_size images from this pool. n can be reasonably small (< num_workers) to reduce memory usage.

BTW, mutilprocessing distributed training is much faster than DataParallel in large batch_size case, even on a single node with multiple gpus. The reason might be that each process just need to load batch_size / gpus_per_node images.

1 Like

There is a distinction that makes all the difference here. Thanks to the limitations of Python, it’s not threads we’re dealing with but processes. Each contiguous batch tensor is assembled by each worker using the collate fn in a separate process when num_workers > 0. Python multiprocessing Queues (with some Tensor handling additions) are used to get those batch tensors to the main process via shared memory.

If it was threads, and we were in a language like C++ the memory would be in the same process. With lightweight thread sync primitives, you could have enough fine grained control to interleave loading of samples into batches without needing to have as many batch tensors in flight. That doesn’t really work well/at all with multiple processes and the limited control you have in Python.