How to prefetch data when processing with GPU?

Hi everyone,

I’m new to Pytorch/torch. Is there any deme codes for prefetching data with another process during GPU doing computation?

Thank you very much.

Best,
Yikang

4 Likes

Pytorch provides the dataloader for some common vision datasets.

You could also follow what’s in https://github.com/pytorch/pytorch/tree/master/torch/utils/data, and build you own.

Got it! thank you very much!

To complement @ruotianluo’s answer, here you can find an example usage of data loaders. Also keep in mind that they work with all kinds of datasets, not only vision! (A dataset is just an object that defines two methods as in here).

1 Like

@apaszke I wonder if there’s a prefetch option on GPU in PyTorch, just as in Caffe https://github.com/BVLC/caffe/blob/master/src/caffe/proto/caffe.proto#L670-L672. For example, my model is small, and I can trade GPU memory for using fewer workers or less fast HDD. Is this possible in PyTorch? This is especially an issue if I train multiple small models on PyTorch on separate cards of a single machine, as I run out of workers.

1 Like

there isn’t a prefetch option, but you can write a custom Dataset that just loads the entire data on GPU and returns samples from in-memory. In that case you can just use 0 workers in your DataLoader

1 Like

Thanks. I want a prefetching option when dealing with dataset like ImageNet, and it should work with multi GPU. The approach you proposed would not work in that case.

1 Like

if there is no prefetch, the GPUs will not be fully occupied due to waiting for data loading and preprocessing. is this understanding correct?

1 Like

we already have prefetch (see the imagenet or dcgan examples), but we dont prefetch directly onto the GPU. We prefetch onto CPU, do data augmentation and then we put the mini-batch in CUDA pinned memory (on CPU) so that GPU transfer is very fast. Then we give data to network to transfer to GPU and train.

21 Likes

Using prefetch seems to decrease speed in my case. I can run ~100 examples/second using num_workers = 0. But only run ~50 examples/second using num_workers = 1. The more workers, the slower the speed. Is there any reasons behind this?

4 Likes

Same situation in my code. The dataloader can’t make full use of the cpus.

2 Likes

Please let me know if I can make the full use of my CPU. I tried overclocking my GPU

1 Like

Is data augmentation done in a single thread even when data loader is multi-threaded with num_workers > 2?

2 Likes

Same situation.

I have 12 cores, I find that the dataloader can not make full use of them. It can only make about 60% ability of them. I set the num_worker as 12 and my IO bound is enough. For training network like resnet, the data preprocess time cost more than 70% time in the whole process.

1 Like

I have a computation bound augment process, and I set num_workers=50.
Then there’s a obvious delay in data loading every 50 batches. It seems that each worker prepare only one batch and wait still until next invocation. CPU and IO utilities are all low.

I use the imagenet example.
Here is some logging.

Epoch: [0][0/1124]      Time 100.303 (100.303)  Data 95.814 (95.814)    Loss 2.1197 (2.1197)    Acc@1 1.562 (1.562)     Acc@2 51.562 (51.562)
Epoch: [0][10/1124]     Time 0.422 (9.584)      Data 0.011 (8.753)      Loss 2.2578 (2.1322)    Acc@1 4.297 (4.865)     Acc@2 6.250 (20.312)
Epoch: [0][20/1124]     Time 0.288 (5.191)      Data 0.078 (4.606)      Loss 2.3653 (2.1892)    Acc@1 16.406 (7.310)    Acc@2 63.281 (21.001)
Epoch: [0][30/1124]     Time 0.288 (3.604)      Data 0.071 (3.134)      Loss 2.2124 (2.1810)    Acc@1 23.047 (10.144)   Acc@2 46.094 (28.679)
Epoch: [0][40/1124]     Time 0.295 (2.795)      Data 0.094 (2.382)      Loss 2.2084 (2.1984)    Acc@1 2.344 (11.719)    Acc@2 3.906 (30.164)
Epoch: [0][50/1124]     Time 0.434 (2.312)      Data 0.098 (1.927)      Loss 2.0883 (2.1817)    Acc@1 27.344 (11.060)   Acc@2 50.391 (28.592)
Epoch: [0][60/1124]     Time 30.907 (2.497)     Data 30.285 (2.119)     Loss 2.0824 (2.1731)    Acc@1 19.531 (12.897)   Acc@2 41.406 (31.276)
Epoch: [0][70/1124]     Time 0.816 (2.326)      Data 0.102 (1.933)      Loss 2.0904 (2.1663)    Acc@1 22.656 (14.277)   Acc@2 43.359 (33.291)
Epoch: [0][80/1124]     Time 0.192 (2.094)      Data 0.065 (1.707)      Loss 2.1167 (2.1607)    Acc@1 21.094 (15.355)   Acc@2 42.188 (34.891)
Epoch: [0][90/1124]     Time 0.788 (1.904)      Data 0.168 (1.528)      Loss 2.2529 (2.1558)    Acc@1 25.781 (16.303)   Acc@2 46.484 (36.178)
Epoch: [0][100/1124]    Time 0.324 (1.749)      Data 0.212 (1.385)      Loss 2.1106 (2.1537)    Acc@1 21.484 (16.901)   Acc@2 48.438 (37.233)
Epoch: [0][110/1124]    Time 0.414 (1.633)      Data 0.002 (1.267)      Loss 2.0465 (2.1491)    Acc@1 23.438 (17.325)   Acc@2 48.438 (37.968)
Epoch: [0][120/1124]    Time 45.406 (1.906)     Data 44.589 (1.537)     Loss 2.2800 (2.1496)    Acc@1 20.703 (17.859)   Acc@2 42.578 (38.598)
Epoch: [0][130/1124]    Time 0.591 (1.824)      Data 0.007 (1.454)      Loss 2.0338 (2.1466)    Acc@1 19.141 (18.079)   Acc@2 45.703 (39.054)
Epoch: [0][140/1124]    Time 0.510 (1.765)      Data 0.184 (1.397)      Loss 2.1249 (2.1457)    Acc@1 21.875 (18.426)   Acc@2 49.609 (39.583)
Epoch: [0][150/1124]    Time 0.203 (1.670)      Data 0.004 (1.308)      Loss 2.1863 (2.1450)    Acc@1 20.703 (18.755)   Acc@2 42.188 (39.929)
Epoch: [0][160/1124]    Time 0.269 (1.589)      Data 0.084 (1.231)      Loss 2.2051 (2.1434)    Acc@1 23.828 (19.031)   Acc@2 48.047 (40.302)
Epoch: [0][170/1124]    Time 0.281 (1.520)      Data 0.077 (1.163)      Loss 2.1192 (2.1403)    Acc@1 21.875 (19.301)   Acc@2 44.531 (40.634)
Epoch: [0][180/1124]    Time 40.498 (1.675)     Data 39.783 (1.321)     Loss 2.0573 (2.1410)    Acc@1 24.609 (19.572)   Acc@2 47.656 (40.981)
7 Likes

I am experiencing the same issue. No matter how many data loading workers I select, there’s always a delay after their batches have been processed.

have the same issue too.
using tfrecord or mxnet’s rec file to load data doesn’t have this issue.

Same situation.
I moved from tensorflow’s Dataset API that works like a charm.

But dataloader loads batch and waits for model optimization (but it shouldn’t wait!)

4 Likes

I’ve spent some time working on this problem over a variety of projects. I’ve cut and pasted some past thoughts, bullets, etc from previous discussions. My background involves architecting systems that move large volumes of data from network cards to storage and then back again on request, with processing between the steps. A very similar set of concerns.

The two main constraints that usually dominate your PyTorch training performance and ability to saturate the shiny GPUs are your total CPU IPS (instructions per second) and your storage IOPS (I/O per second).

You want the CPUs to be performing preprocessing, decompression, and copying – to get the data to the GPU. You don’t want them to be idling or busy-waiting for thread/process synchronization primitives, IO, etc. The easiest way to improve CPU utilization with the PyTorch is to use the worker process support built into Dataloader. The preprocessing that you do in using those workers should use as much native code and as little Python as possible. Use Numpy, PyTorch, OpenCV and other libraries with efficient vectorized routines that are written in C/C++. Looping through your data byte by byte in Python will kill your performance, massacre the memory allocator, etc.

With most common use cases, the preprocessing is done well enough to not be an issue. Things tend to fall apart dramatically hitting the IOPS limit of your storage. Most simple PyTorch datasets tend to use media stored in individual files.

  • Modern filesystems are good, but when you have thousands of small files and you’re trying to move GB/s of data, reading each file individually can saturate your IOPS long before you can ever maximize GPU or CPU utilization.
  • Just opening a file by name with PIL can be an alarming number of disk seeks (profile it)
  • Quick fix: buy an NVME SSD drive, or two.
  • SATA SSD is not necessarily going to cut it, you can saturate them with small to medium image files + default loader/dataset setups feeding multiple GPUs.
  • Magnetic drives are going to fall on their face
  • If you are stuck with certain drives or max out the best, the next move requires more advanced caching, prefetching, on-disk format – move to an index/manifest + record based bulk data (like tfrecord or RecordIO) or an efficient memory-mapped/paged in-process DB
  • I’ve leveraged LMDB successfully with PyTorch and a custom simplification of the Python LMDB module. My branch here (https://github.com/rwightman/py-lmdb/tree/rw). I didn’t document or explain what I did there or why, ask if curious.
  • Beyond an optimal number (experiment!), throwing more worker processes at the IOPS barrier WILL NOT HELP, it’ll make it worse. You’ll have more processes trying to read files at the same time, and you’ll be increasing the shared memory consumption by significant amounts for additional queuing, thus increasing the paging load on the system and possibly taking you into thrashing territory that the system may never recover from
  • Once you have saturated the IOPS of your storage or taxed the memory subsystem and entered into a thrashing situation, it won’t look like you’re doing a whole lot. There will be a lot of threads/processes (including kernel ones) basically not doing much besides waiting for IO, page faults, etc. Behaviour will usually be sporadic and bursty once you cross the line of what can be sustained by your system, much like network link utilization without flow control (queuing theory).

Other pointers for a fast training setup with minimal work over the defaults:

  • Employ some of the optimizations in NVIDIA’s examples (https://github.com/NVIDIA/apex/tree/master/examples/imagenet). NVIDIA’s fast_collate and prefetch loader w/ GPU normalization step do help a bit.
  • I’ve seen big gains over torch.DataParallel using apex.DistributedDataParallel. Moving from ‘one main process + worker process + multiple-GPU with DataParallel’ to 'one process-per GPU with apex (and presumably torch)
  • DistributedDataParallel has always improved performance for me. Remember to (down)scale your worker processes per training process accordingly. Higher GPU utilization and less waiting for synchronization usually results, the variance in batch times will reduce with the average time moving closer to the peak.
  • Use SIMD fork of Pillow with default PyTorch transforms, or write your own OpenCV image processing and loading routines
  • Don’t leave the dataloader pin_memory=‘True’ on by default in your code. There was a reason why PyTorch authors left it as False. I’ve run into many situations where True definitely does cause extremely negative paging/memory subsystem impact . Try both.

An observation on the tfrecord/recordio chunking. For IO, even flash based, randomness is bad, sequential chunks are good. Hugely so when you have to move physical disk heads. The random/shuffled nature of training is thus worst case. When you see gains using record/chunked data, it’s largely due to the fact that you read data in sequential chunks. This comes with a penalty. You’ve constrained how random your training samples can be from epoch to epoch. With tfrecord, you usually shuffle once when you build the the tfrecord chunks. When training, you can only shuffle again within some number of queued record files (constrained by memory), not across the whole dataset. In many situations, this isn’t likely to cause problems, but you do have to be aware for each use case, tune your record size to balance performance with how many samples you’d like to shuffle at a time.

146 Likes

Thanks for this beautiful explanation and digging into some bottlenecks!
It’s surely a great read for a lot of users here! :wink:

9 Likes