How to prefetch data when processing with GPU?

@rwightman I totally agree that moving things to GPU should be done only when your CPU is overworked. It all depends on the dataset you have (the bigger images the more data processing, decoding especially), what DL framework you are using and how computational intensive is your network. I.e. for RN50 we can easily saturate GPU while for RN18 it will rather starve for the data.

Regarding the video support we are still at the very initial stage, we have a decoder and just basic operators to show how it works and what kind of use cases people has for that. We really count on some external contribution to help us extend support for such workloads.

Regarding memory, there is a lot of factors. Mostly it depends on how your pipeline looks like. DALI is not executing in place and is processing data in batches. So we need an intermediate buffer between every operator. So for every operator n we need Xn*Yn*N (N-number of sample in the batch). Some operators like decoder can have big images at the output and each is different, while some like crop will have their output with a fixed size. Also, DALI provides multiple buffering so we need to add space for the output buffers as well. So as the final formula I would count the worst case sizes of images in the batch at each operator output + prefetch queue depth number of output buffers. Also, some operators, like resize need some scratch buffer on their own. We are aware that DALI memory consumption is far from perfect and we want to improve it, but not sacrificing the performance. That is why this will be part of more significant architecture rework.
As a side note I can add that to improve a bit memory consumption (at least for the decoder) DALI provides (thanks to nvJPEG) ROI based decoding so you can save a bit of memory not decoding the whole image but just the part of it which will be processed later (as cropping is usually part of the processing pipeline).