CUDA L2 cache Set-Aside for Persisting Accesses Do not work in Pytorch_CUDA_EXTENSTION

I’m trying to use L2 cache Set-Aside for Persisting Accesses on my Nvidia device.
My GPU is “NVIDIA RTX A6000” with 6291456 bytes L2 cache on CUDA 11.6.

Here is my code snippet:

    cudaStreamAttrValue stream_attribute;                                       // Stream level attributes data structure
    stream_attribute.accessPolicyWindow.num_bytes = 4 * 1024 * 1024;            // Number of bytes for persistence access
    stream_attribute.accessPolicyWindow.hitRatio = 1.0;                         // Hint for cache hit ratio
    stream_attribute.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting; // Persistence Property
    stream_attribute.accessPolicyWindow.missProp = cudaAccessPropertyStreaming; // Type of access property on cache miss

    // Set the attributes to a CUDA stream of type cudaStream_t
    cudaStream_t stream;
    cudaStreamCreate(&stream); // Create CUDA stream
    cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);
     .....
      myKernel<<<grid,blk,0,stream>>>>();
     

Starting with CUDA 11.0, devices of compute capability 8.0 and above have the capability to influence persistence of data in the L2 cache, potentially providing higher bandwidth and lower latency accesses to global memory.

I’m trying to use L2 cache to accelerate my kernel

  1. my code works well compiled with NVCC without integrated into Pytorch. I can see some performance improvement using L2 cache
  2. But ! When using pytorch cuda_extension and pybind11 to pack it as a python module i dont see any acceleration in kernel performance.With some tiny changes in code:
torch::Tensor myApp(
    torch::Tensor input){
  cudaStreamAttrValue stream_attribute;                                       // Stream level attributes data structure
    stream_attribute.accessPolicyWindow.num_bytes = 4 * 1024 * 1024;            // Number of bytes for persistence access
    stream_attribute.accessPolicyWindow.hitRatio = 1.0;                         // Hint for cache hit ratio
    stream_attribute.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting; // Persistence Property
    stream_attribute.accessPolicyWindow.missProp = cudaAccessPropertyStreaming; // Type of access property on cache miss

    // Set the attributes to a CUDA stream of type cudaStream_t
    cudaStream_t stream;
    cudaStreamCreate(&stream); // Create CUDA stream
    cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);
    myKernel<<<grid,blk,0,stream>>>();

}

My Pytorch version is 1.13 with CUDA11.6
If there are any conflict between pytorch and L2 cache setting?
It would be helpful if you can provide some hints or suggestions here