Significant delay when accessing tensor returned from CUDA extension (MWE)

PyTorch 1.0
Python 3.6
Nvidia driver 410.78
CUDA 10.0

I have a custom extension that I wrote in CUDA/C++ which renders the points of a mesh to an image. I have previously used this extension with no performance issues. Recently, I am noticing a very significant delay when accessing any of the tensors returned by this extension.

For example, the forward pass only takes 190 microseconds. Accessing the returned tensor in Python, either with print or some other operation, takes 5.3 seconds.

I am at a loss as to where I should look to identify the source of this issue. If the delay was observed in the forward pass, I would be able to look at the CUDA code. Since the delay occurs after the tensor is returned, I am wondering if there is a driver issue. I have tried this on two different machines with similar configurations (see software above).

Does anyone have any suggestions as to where I should begin looking?

UPDATE 2019/04/16

I have created a minimal working example which demonstrates the issue. The code is a modified version of a rasterizer that returns a binary mask indicating which vertices are visible. My code is available from the following repository:

I have included a mesh file to use, installation instructions, and example output highlighting the issue. I also added a C++ version for comparison.