Cuda graph capture error

I am trying to use CUDA graph in my PyTorch project, But I got error shows below. Could you please give me some help, thanks in advance.

Traceback (most recent call last):    File "/workspace2/Code/open-catalyst/ocpmodels/trainers/mlperf_forces_trainer.py", line 525, in train
    self._backward(loss)
  File "/workspace2/Code/open-catalyst/ocpmodels/trainers/base_trainer.py", line 611, in _backward
    loss.backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: operation would make the legacy stream depend on a capturing blocking stream
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from block at /workspace/pytorch/c10/cuda/impl/CUDAGuardImpl.h:155 (most recent call first):
frame #0: <unknown function> + 0xb3540 (0x7fb1c5f9e540 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>::operator()() const + 0x50 (0x7fb209d56ac0 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x44 (0x7fb1c5f9d344 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x7ff6e (0x7fb1c609af6e in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x82473 (0x7fb1c609d473 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x246460d (0x7fb20604960d in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x2468272 (0x7fb20604d272 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x2464b29 (0x7fb206049b29 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x5ffc341 (0x7fb209be1341 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x602589f (0x7fb209c0a89f in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0xc2b (0x7fb209bd7f03 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x53d (0x7fb209bd4687 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #12: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0xae (0x7fb209bd3fd2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x73 (0x7fb216d9a44b in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #14: <unknown function> + 0x6015dc1 (0x7fb209bfadc1 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x6015bcb (0x7fb209bfabcb in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x60159fd (0x7fb209bfa9fd in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x60156d7 (0x7fb209bfa6d7 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #18: std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (torch::autograd::Engine::*)(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool), torch::autograd::Engine*, signed char, std::shared_ptr<torch::autograd::ReadyQueue>, bool> > >::_M_run() + 0x20 (0x7fb209bfa546 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0xd6de4 (0x7fb219887de4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #20: <unknown function> + 0x8609 (0x7fb24effc609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #21: clone + 0x43 (0x7fb24edbb163 in /usr/lib/x86_64-linux-gnu/libc.so.6)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 126, in <module>
    Runner()(config)
  File "main.py", line 66, in __call__
    self.task.run()
  File "/workspace2/Code/open-catalyst/ocpmodels/tasks/task.py", line 35, in run
    self.trainer.train(
  File "/workspace2/Code/open-catalyst/ocpmodels/trainers/mlperf_forces_trainer.py", line 525, in train
    self._backward(loss)
  File "/opt/conda/lib/python3.8/site-packages/torch/cuda/graphs.py", line 149, in __exit__
    self.cuda_graph.capture_end()
  File "/opt/conda/lib/python3.8/site-packages/torch/cuda/graphs.py", line 71, in capture_end
    super(CUDAGraph, self).capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Did you follow the instructions from the docs e.g. for DDP here? CUDA semantics — PyTorch master documentation

Thanks for your reply. After I modified codes following the instructions, it works fine!