Torch operation right after TensorRT's execute_async_v3

I am relatively new to TensorRT so I would like to clarify its intrinsic behaviour when combined with Torch. I understand that execute_async_v3 launches the kernel on gpu and gives the control back to python interpreter immediately, hence host and device can work asynchronously. Immediately after executing the kernel I want to perform a pytorch operation e.g. torch.nn.functional.interpolate on the cuda tensors. Do I need to call torch.cuda.synchronise() to ensure that the result of the kernel execution is ready? Or does calling such a function implicitely mean that it will be executed after the first one if it is in the same stream? I ran a small experiment with and without torch.cuda.synchronise() and both results are correct. However, the one without torch.cuda.synchronise() takes only 2ms as compared to 9ms in the other case.

In my example I am using torch2trt/torch2trt/trt_module.py at 4e820ae31b4e35d59685935223b05b2e11d47b03 · NVIDIA-AI-IOT/torch2trt · GitHub. Thanks to this excellent repo tensorrt engine outputs the results directly to a torch tensor on cuda. An example of what I would like to do is:

        # execute
        outputs = [None] * len(self.output_names)
        for i, output_name in enumerate(self.output_names):
            dtype = torch_dtype_from_trt(self.engine.get_tensor_dtype(output_name))
            shape = tuple(self.context.get_tensor_shape(output_name))
            device = torch_device_from_trt(self.engine.get_tensor_location(output_name))
            output = torch.empty(size=shape, dtype=dtype, device=device)
            outputs[i] = output
            self.context.set_tensor_address(output_name, output.data_ptr())

        self.context.execute_async_v3(torch.cuda.current_stream().cuda_stream)

        # here
        torch.nn.functional.interpolate(outputs[0], **kwargs)
        

        if self.output_flattener is not None:
            outputs = self.output_flattener.unflatten(outputs)
        else:
            outputs = tuple(outputs)
            if len(outputs) == 1:
                outputs = outputs[0]

        return outputs

Thank you in advance!

Environment

TensorRT Version: 10.4.0 (docker: dustynv/l4t-pytorch:r36.4.0)
Pytorch Version: 2.4.0
GPU Type: Tegra
System: Jetson Orin AGX