Some guesses:
-
The smaller model(s) may use kernels that have smaller launch bounds, allowing concurrent execution of multiple streams. I wonder if the larger models have more kernels that can only execute one-by-one on the device, exposing some potential deadlocks if the streams have complex dependencies. However, this seems unlikely as we would expect to see similar behavior with
CUDA_LAUNCH_BLOCKING=1. (I assume this is what you meant asCUDA_LAUNCH_BLOCKING=0should be the default. -
I wonder if synchronizing calls e.g., to
cudaMallococcur with the larger models and cause problems as the caching allocator may be running out of free memory—requiring discarded Tensors to be recycled and triggering new calls tocudaMalloc. Does settingPYTORCH_NO_CUDA_MEMORY_CACHING=1in the small-model case also cause problems? If so then I would guess that it could be related to memory usage and caching allocator behavior.