Using PYTORCH_CUDA_ALLOC_CONF

I am training a model, and am unable to get steady state GPU memory usage. The GPU memory utilization continuously varies between 10 GB to 19 GB.

Following the instructions in A guide to PyTorch’s CUDA Caching Allocator | Zach’s Blog, I am running the same code with CUDA_PYTORCH_CUDA_ALLOC_CONF=roundup_power2_divisions:1 to perform more aggressive rounding, but this does not solve the issue.

Upon searching the entire torch codebase (from my installation) for the term roundup_power2_division, I get no search results. I am using torch==2.0.1. I was wondering whether torch is even using this modified environment variable. How can I verify that it is indeed using the new cache allocation configuration?

I was able to verify that the flag is being used because the code throws an error if you set the env variable to some random value.
I’m not able to get steady state using CUDA_PYTORCH_CUDA_ALLOC_CONF=roundup_power2_divisions:2.
I tried using roundup_power2_divisions:1 is supposed to give more aggressive rounding, so I tried that but I get the following error:

  File "/home/shreeshail/miniconda3/envs/amphion/lib/python3.9/site-packages/torch/nn/modules/activation.py", line 1205, in forward
    attn_output, attn_output_weights = F.multi_head_attention_forward(
  File "/home/shreeshail/miniconda3/envs/amphion/lib/python3.9/site-packages/torch/nn/functional.py", line 5343, in multi_head_attention_forward
    attn_output = torch.bmm(attn_output_weights, v)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7c88adb9e4d7 in /home/shreeshail/miniconda3/envs/amphion/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7c88adb6836b in /home/shreeshail/miniconda3/envs/amphion/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7c88bebd4fa8 in /home/shreeshail/miniconda3/envs/amphion/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::startedGPUExecutionInternal() const + 0x6c (0x7c884e8bc4dc in /home/shreeshail/miniconda3/envs/amphion/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isStarted() + 0x58 (0x7c884e8bfaf8 in /home/shreeshail/miniconda3/envs/amphion/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x448 (0x7c884e8c12d8 in /home/shreeshail/miniconda3/envs/amphion/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xd3e95 (0x7c88902f0e95 in /home/shreeshail/miniconda3/envs/amphion/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7c88c5294ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126850 (0x7c88c5326850 in /lib/x86_64-linux-gnu/libc.so.6)

Thinking this is an insufficient memory issue, I reduced the max_tokens parameter (I’m using dynamic batching), upon which I get a different error

attn_output, attn_output_weights = F.multi_head_attention_forward(
  File "/home/shreeshail/miniconda3/envs/amphion/lib/python3.9/site-packages/torch/nn/functional.py", line 5346, in multi_head_attention_forward
    attn_output = linear(attn_output, out_proj_weight, out_proj_bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 512 n 4144 k 512 mat1_ld 512 mat2_ld 512 result_ld 512 abcType 0 computeType 77 scaleType 0