Stuck at Stream Synchronization

Some guesses:

  1. The smaller model(s) may use kernels that have smaller launch bounds, allowing concurrent execution of multiple streams. I wonder if the larger models have more kernels that can only execute one-by-one on the device, exposing some potential deadlocks if the streams have complex dependencies. However, this seems unlikely as we would expect to see similar behavior with CUDA_LAUNCH_BLOCKING=1. (I assume this is what you meant as CUDA_LAUNCH_BLOCKING=0 should be the default.

  2. I wonder if synchronizing calls e.g., to cudaMalloc occur with the larger models and cause problems as the caching allocator may be running out of free memory—requiring discarded Tensors to be recycled and triggering new calls to cudaMalloc. Does setting PYTORCH_NO_CUDA_MEMORY_CACHING=1 in the small-model case also cause problems? If so then I would guess that it could be related to memory usage and caching allocator behavior.