Parallelize tensor slicing over batch at inference time on one gpu

mlizhardy · July 16, 2021, 7:16pm

I have a function which slices and resizes tensors, essentially cropping an image around an object, resizing it and putting it in a batch. How can I parallelize this? I know I cannot do it with batched operations because the size of the slice differs. But, there’s no reason I can’t run them in parallel. How can I parallelize the function to have them run at the same time on the same GPU?

This is the function:

def slice_and_resize(imgs, center, diff):
        # TODO: parallelize slicing
        for i in range(B):
            xc, yc = center[i]
            diff_i = diff[i]
            if diff_i == 0:
                cropped_img = torch.zeros((1, 3, 224, 224)).cuda()
            else:
                x1 = torch.maximum(torch.FloatTensor([0]), xc - diff_i)
                x2 = torch.minimum(xc + diff_i, torch.FloatTensor([w]))
                y1 = torch.maximum(torch.FloatTensor([0]), yc - diff_i)
                y2 = torch.minimum(yc + diff_i, torch.FloatTensor([h]))
                cropped_img = (imgs[i, int(y1):int(y2), int(x1):int(x2), :])
                cropped_img = F.interpolate(cropped_img.permute(2,0,1).unsqueeze(0), (224,224))

            cropped_imgs[i] = cropped_img
        return cropped_imgs

of note, this function is called at inference time, so I run it with torch.no_grad()

mlizhardy · July 16, 2021, 7:43pm

This could potentially be done with a dataloader, but that seems like a bad workaround.

update: tried a dataloader, it was slower than using a loop.

eqy · July 16, 2021, 8:29pm

Dataloaders should be faster out of the box with sufficient parallelism provided there isn’t a bottleneck somewhere else and a sufficient number of workers are used.

Could you share your dataloader implementation? My guess without seeing the code is that if the dataloader is doing this on GPU could be slower with “parallelization” because attempting to run many lightweight kernels on the same CUDA stream incurs more overhead/serialization than doing it across different processes on CPU.

mlizhardy · July 16, 2021, 9:16pm

sending a tensor to cpu and back to gpu takes too much time. (a batch of images 12x480x640x3 takes 31ms to send to cpu and back to gpu)

considering this slicing function takes ~1ms per image, I can’t afford to send to/from GPU. I need to run in parallel on one GPU.

mlizhardy · July 16, 2021, 9:31pm

Essentially, I’m asking: “what is the best way to efficiently do multiprocessing on a single gpu in pytorch at inference time?”

eqy · July 16, 2021, 10:03pm

I’m not sure I understand your model pipeline here. Are you slicing some model output that was already on GPU, or is this part of a data loading step that happens before model inference?

If you truly must run a different computation on each example (it sounds like this is the case if the crops have different sizes) and this has to happen on the GPU and not as a preprocessing step, I don’t think it will be easy to get an efficient implementation since this is essentially data-dependent control flow.

mlizhardy · July 16, 2021, 10:42pm

Ah I understand the confusion. I am slicing a model output that is already on GPU, interpolating it on GPU and sending that GPU batch to another model on GPU. I am not using any sort of dataloader in my pipeline, I get GPU batches as input directly from the server. I only suggested a dataloader as a possible work around because it does single-gpu threading, but it is not efficient. Think of it as an object detection + classification pipeline, where I need to crop&resize the detected objects, and send them to a classification network.

Data-dependency is exactly my point, I cannot use batch operations because the slicing parameters are data-dependent. So I need an efficient MPSG solution.