Hi pytorch Team,
reviewing the CUDACachingAllocator.cpp i see it provides a recordStream() functionality to help insert the correct synchronization when allocations are used on multiple streams. This will ensure that the block is not reused before each recorded stream completes work.
We have a scenario where we want to run with “PYTORCH_NO_CUDA_MEMORY_CACHING” set to true in Production workload. I want to understand what will happen to record_stream functionality if we enable “PYTORCH_NO_CUDA_MEMORY_CACHING” by setting it to true. Does pytorch handle the stream aware allocation in different way when “PYTORCH_NO_CUDA_MEMORY_CACHING” is enabled?