I was able to verify that the flag is being used because the code throws an error if you set the env variable to some random value.
I’m not able to get steady state using CUDA_PYTORCH_CUDA_ALLOC_CONF=roundup_power2_divisions:2
.
I tried using roundup_power2_divisions:1
is supposed to give more aggressive rounding, so I tried that but I get the following error:
File "/home/shreeshail/miniconda3/envs/amphion/lib/python3.9/site-packages/torch/nn/modules/activation.py", line 1205, in forward
attn_output, attn_output_weights = F.multi_head_attention_forward(
File "/home/shreeshail/miniconda3/envs/amphion/lib/python3.9/site-packages/torch/nn/functional.py", line 5343, in multi_head_attention_forward
attn_output = torch.bmm(attn_output_weights, v)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7c88adb9e4d7 in /home/shreeshail/miniconda3/envs/amphion/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7c88adb6836b in /home/shreeshail/miniconda3/envs/amphion/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7c88bebd4fa8 in /home/shreeshail/miniconda3/envs/amphion/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::startedGPUExecutionInternal() const + 0x6c (0x7c884e8bc4dc in /home/shreeshail/miniconda3/envs/amphion/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isStarted() + 0x58 (0x7c884e8bfaf8 in /home/shreeshail/miniconda3/envs/amphion/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x448 (0x7c884e8c12d8 in /home/shreeshail/miniconda3/envs/amphion/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xd3e95 (0x7c88902f0e95 in /home/shreeshail/miniconda3/envs/amphion/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7c88c5294ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126850 (0x7c88c5326850 in /lib/x86_64-linux-gnu/libc.so.6)
Thinking this is an insufficient memory issue, I reduced the max_tokens
parameter (I’m using dynamic batching), upon which I get a different error
attn_output, attn_output_weights = F.multi_head_attention_forward(
File "/home/shreeshail/miniconda3/envs/amphion/lib/python3.9/site-packages/torch/nn/functional.py", line 5346, in multi_head_attention_forward
attn_output = linear(attn_output, out_proj_weight, out_proj_bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 512 n 4144 k 512 mat1_ld 512 mat2_ld 512 result_ld 512 abcType 0 computeType 77 scaleType 0