I am training/finetuning a llama-2-7b model on 2 A100 80GBs with a custom RL algorithm. I am using pytorch nightly versions with cuda 11.8 and FSDP based on GitHub - facebookresearch/llama-recipes: Examples and recipes for Llama 2 model. The model trains for a couple of steps and then crashes with the following stack trace:
[E ProcessGroupNCCL.cpp:915] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x2b6b6826f647 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x2b6b6822b8f9 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x2b6b68139588 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x2b6b156c3b90 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x2b6b156c79b8 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x24b (0x2b6b156de1db in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x2b6b156de4e8 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xb94cf (0x2b6ac83184cf in /cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/gcc-6.3.0-sqhtfh32p5gerbkvi5hih7cfvcpmewvj/lib64/libstdc++.so.6)
frame #8: <unknown function> + 0x7ea5 (0x2b6abd860ea5 in /lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x2b6abe27cb0d in /lib64/libc.so.6)
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x2b6b6826f647 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x2b6b6822b8f9 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x2b6b68139588 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x2b6b156c3b90 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x2b6b156c79b8 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x24b (0x2b6b156de1db in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x2b6b156de4e8 in /cluster/project/sachan/kushal/llenv/lib64/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xb94cf (0x2b6ac83184cf in /cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/gcc-6.3.0-sqhtfh32p5gerbkvi5hih7cfvcpmewvj/lib64/libstdc++.so.6)
frame #8: <unknown function> + 0x7ea5 (0x2b6abd860ea5 in /lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x2b6abe27cb0d in /lib64/libc.so.6)
Fatal Python error: Aborted
Things I have tried (based on @ptrblck suggestions from other similar posts):
- Tried running the model with
CUDA_LAUNCH_BLOCKING=1
. The stack trace remains identical. - Tried compute sanitizer which did not help either.
I have also tried other debug env variables likeNCCL_DEBUG=INFO
but the trace remains the same. Is there something else that I can do to debug and find the error? I print the intermediate information in my code to the terminal and find that both of my processes seem to be running the model independently:
generated samples
generated samples
att_mask device: cuda:1
att_mask device: cuda:0
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0][1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
tensor(0., device='cuda:1', grad_fn=<AddBackward0>)
tensor(0., device='cuda:1', grad_fn=<AddBackward0>)
Calculated RL loss
tensor(6.6880, device='cuda:0', grad_fn=<AddBackward0>)
tensor(6.6880, device='cuda:0', grad_fn=<AddBackward0>)
which is something I did not think FSDP does, as it shards the model across 2 processes instead of replicating the model separately across 2 GPUs. Am I missing some synchronization code in my script? Any ideas to further track this down would be helpful.
Thanks!