CUDA Memory Profiling

This would indicate that te memory is indeed used and is not in the cache.
I would recommend scaling down the use case and checking the memory stats in a simple example as was discussed in e.g. this topic.