I have a function which slices and resizes tensors, essentially cropping an image around an object, resizing it and putting it in a batch. How can I parallelize this? I know I cannot do it with batched operations because the size of the slice differs. But, there’s no reason I can’t run them in parallel. How can I parallelize the function to have them run at the same time on the same GPU?
Dataloaders should be faster out of the box with sufficient parallelism provided there isn’t a bottleneck somewhere else and a sufficient number of workers are used.
Could you share your dataloader implementation? My guess without seeing the code is that if the dataloader is doing this on GPU could be slower with “parallelization” because attempting to run many lightweight kernels on the same CUDA stream incurs more overhead/serialization than doing it across different processes on CPU.
I’m not sure I understand your model pipeline here. Are you slicing some model output that was already on GPU, or is this part of a data loading step that happens before model inference?
If you truly must run a different computation on each example (it sounds like this is the case if the crops have different sizes) and this has to happen on the GPU and not as a preprocessing step, I don’t think it will be easy to get an efficient implementation since this is essentially data-dependent control flow.
Ah I understand the confusion. I am slicing a model output that is already on GPU, interpolating it on GPU and sending that GPU batch to another model on GPU. I am not using any sort of dataloader in my pipeline, I get GPU batches as input directly from the server. I only suggested a dataloader as a possible work around because it does single-gpu threading, but it is not efficient. Think of it as an object detection + classification pipeline, where I need to crop&resize the detected objects, and send them to a classification network.
Data-dependency is exactly my point, I cannot use batch operations because the slicing parameters are data-dependent. So I need an efficient MPSG solution.