I’ve noticed a lot of cuda memory frees when profiling my code. Doing memory profiling it seems like the output of all reduce doesn’t get reused by the caching allocator but gets accumulated and then freed.
Yes, this env var is the correct fix. I think we want it to be the default, but it is a major change and requires some effort from someone to drive the flip.