Inference speed drops off dramatically after first few batches

Hey y’all, I’m using PVCNN with custom weights to do semantic segmentation on point clouds. For the first 6 batches or so, inference occurs quickly - but thereafter the speed is reduced considerably. I did not trust measurements via time() from python, so I checked the profiler and sure enough the result is consistent with what I saw:

Frustratingly, enabling use_cuda=True in the profiler makes it such that all batches infer at the same, slow pace; total inference time increases by seconds when CUDA profiling is enabled. My inference loop looks like so:

points = torch.from_numpy(my_pointclouds)
loader = DataLoader(points, batch_size=40, num_workers=0, pin_memory=True)
with torch.no_grad() and profiler.profile(record_shapes=True) as prof:
    for batch in loader:
        outputs = model(, non_blocking=True))
        del batch, outputs

The dataset should not be too large for my 8 GB GTX 1080, as it only takes up a few MB.
What could possibly be negatively affecting the inference speed of PVCNN? Why does profiling CUDA have such an impact? Thanks

The “non-CUDA” profiling might not synchronize the code, so your times and the profile might just show the kernel launch times and dispatching, while use_cuda=True would show the actual CUDA runtime.

I thought so too, but setting use_cuda=True actually measurably changes the running time of the script as a whole