Rewriting the CUDA cache memory allocator

Hi,
I am trying to rewrite the CUDA cache memory allocator based on the CUDAPluggableAllocator, since I think the current CUDA memory allocator may not fit for all the cases. I want to confirm if the CUDA memory allocator is really appropriate or not for all the hardware settings.