Best place to subscribe for a CUDA function callback in Pytorch

I am analyzing an undocumented CUDA API in PyTorch, and I want this function to react to each kernel launch during inference/training workloads. I can retrieve this function using a callback from cudaGetExportTable outside PyTorch in a simple test:

For example:
static const CUuuid callbackFunid = …;
// Get the function from the table
cuGetExportTable((const void**)&subscribeFuncAddr, &callback_funcs_id);
// ‘subscribe’ is a function defined elsewhere that has the same signature as the callback function
subscribe = (typeof(subscribe))subscribeFuncAddr;
// ‘res’ equals 0 if successful
res = subscribe(&my_hndl, f, NULL);

I first tested this code for some elementwise kernels before their launch (e.g., before this line in CUDALoops.cuh). However, when I reran my workload, I encountered CUDA error code 999 when I tried to subscribe to the function callback (from cuGetExportTable). This indicates that I didn’t retrieve the function properly. I assume this may be because the CUDA driver environment has not been initialized yet.

I then tried to insert my code here in CUDAStream.cpp. It seems that I can get the function callback, but when I profile the workload, I get CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED.

Do you know the best place to insert these codes before launching each kernel in PyTorch? I would appreciate any suggestions and comments.