hello. I’m training rwkv now and struggling with “Cuda : illegal memory access” error.
But, the strange error message is coming from somewhere
sh scripts/train/rwkv3b_pretrain.sh
Current working directory: /home/gpuadmin/Desktop/RWKV/MK1
INFO:pytorch_lightning.utilities.rank_zero:########## work in progress ##########
[W1218 22:04:05.285517191 Module.cpp:178] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
[2024-12-18 22:04:05,746] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Using /home/gpuadmin/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/gpuadmin/.cache/torch_extensions/py310_cu124/wkv6state/build.ninja...
Building extension module wkv6state...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv6state...
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
True
INFO:pytorch_lightning.strategies.deepspeed:initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/4
[W1218 22:04:19.898821787 Module.cpp:178] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
[W1218 22:04:19.906643190 Module.cpp:178] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
[W1218 22:04:19.111964661 Module.cpp:178] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
[2024-12-18 22:04:19,365] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-18 22:04:19,367] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-18 22:04:19,585] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Using /home/gpuadmin/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/gpuadmin/.cache/torch_extensions/py310_cu124/wkv6state/build.ninja...
Building extension module wkv6state...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv6state...
Using /home/gpuadmin/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/gpuadmin/.cache/torch_extensions/py310_cu124/wkv6state/build.ninja...
Building extension module wkv6state...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv6state...
Using /home/gpuadmin/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/gpuadmin/.cache/torch_extensions/py310_cu124/wkv6state/build.ninja...
Building extension module wkv6state...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv6state...
True
INFO:pytorch_lightning.strategies.deepspeed:initializing deepspeed distributed: GLOBAL_RANK: 3, MEMBER: 4/4
True
INFO:pytorch_lightning.strategies.deepspeed:initializing deepspeed distributed: GLOBAL_RANK: 2, MEMBER: 3/4
True
INFO:pytorch_lightning.strategies.deepspeed:initializing deepspeed distributed: GLOBAL_RANK: 1, MEMBER: 2/4
INFO:pytorch_lightning.utilities.rank_zero:Enabling DeepSpeed BF16.
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Using /home/gpuadmin/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Using /home/gpuadmin/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Using /home/gpuadmin/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
INFO:pytorch_lightning.utilities.rank_zero:Name of trainable parameters in optimizers:
INFO:pytorch_lightning.utilities.rank_zero:Number of trainable parameters in optimizers: 344
Using /home/gpuadmin/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/gpuadmin/.cache/torch_extensions/py310_cu124/fused_adam/build.ninja...
/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.027843475341796875 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.1026618480682373 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.1028742790222168 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10142779350280762 seconds
INFO:pytorch_lightning.callbacks.model_summary:
| Name | Type | Params
---------------------------------------------
0 | rwkv | RWKV | 59.7 M
1 | vit | CLIPVisionModel | 303 M
2 | proj | Linear | 327 K
3 | emb_spot | Linear | 131 K
---------------------------------------------
39.2 M Trainable params
324 M Non-trainable params
363 M Total params
1,454.667 Total estimated model params size (MB)
Epoch 0: 0%| | 0/1000 [00:00<?, ?it/s][rank2]:[W1218 22:04:39.800672233 Module.cpp:178] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
[rank0]:[W1218 22:04:40.829596628 Module.cpp:178] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
[rank1]:[W1218 22:04:40.870118621 Module.cpp:178] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
[rank3]:[W1218 22:04:40.879996123 Module.cpp:178] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py:437: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check.
rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.")
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/gpuadmin/Desktop/RWKV/MK1/train.py", line 240, in <module>
[rank2]: trainer.fit(model, data_loader)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
[rank2]: call._call_and_handle_interrupt(
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
[rank2]: return trainer_fn(*args, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
[rank2]: self._run(model, ckpt_path=self.ckpt_path)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
[rank2]: results = self._run_stage()
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
[rank2]: self._run_train()
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
[rank2]: self.fit_loop.run()
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
[rank2]: self.advance(*args, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
[rank2]: self._outputs = self.epoch_loop.run(self._data_fetcher)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
[rank2]: self.advance(*args, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 213, in advance
[rank2]: batch_output = self.batch_loop.run(kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
[rank2]: self.advance(*args, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
[rank2]: outputs = self.optimizer_loop.run(optimizers, kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
[rank2]: self.advance(*args, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 202, in advance
[rank2]: result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 249, in _run_optimization
[rank2]: self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 370, in _optimizer_step
[rank2]: self.trainer._call_lightning_module_hook(
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
[rank2]: output = fn(*args, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1754, in optimizer_step
[rank2]: optimizer.step(closure=optimizer_closure)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
[rank2]: step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 280, in optimizer_step
[rank2]: optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 234, in optimizer_step
[rank2]: return self.precision_plugin.optimizer_step(
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/deepspeed.py", line 132, in optimizer_step
[rank2]: closure_result = closure()
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 149, in __call__
[rank2]: self._result = self.closure(*args, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 135, in closure
[rank2]: step_output = self._step_fn()
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 419, in _training_step
[rank2]: training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_strategy_hook
[rank2]: output = fn(*args, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 351, in training_step
[rank2]: return self.model(*args, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank2]: ret_val = func(*args, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1909, in forward
[rank2]: loss = self.module(*inputs, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/overrides/base.py", line 98, in forward
[rank2]: output = self._forward_module.training_step(*inputs, **kwargs)
[rank2]: File "/home/gpuadmin/Desktop/RWKV/MK1/src/model_state.py", line 466, in training_step
[rank2]: logits, targets = self(batch)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/home/gpuadmin/Desktop/RWKV/MK1/src/model_state.py", line 424, in forward
[rank2]: logits = self.forward_without_last_image(
[rank2]: File "/home/gpuadmin/Desktop/RWKV/MK1/src/model_state.py", line 812, in forward_without_last_image
[rank2]: logits, state = self.bidirectional_forward(x)
[rank2]: File "/home/gpuadmin/Desktop/RWKV/MK1/src/model_state.py", line 453, in bidirectional_forward
[rank2]: x, state[i] = block(x, state[i])
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/home/gpuadmin/Desktop/RWKV/MK1/src/model_state.py", line 262, in forward
[rank2]: x = self.ln0(x.to(torch.float32)).to(self.args.dtype)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 217, in forward
[rank2]: return F.layer_norm(
[rank2]: File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/functional.py", line 2900, in layer_norm
[rank2]: return torch.layer_norm(
[rank2]: RuntimeError: expected scalar type Float but found BFloat16
[rank2]: Exception raised from check_type at aten/src/ATen/core/TensorMethods.cpp:12 (most recent call first):
[rank2]: C++ CapturedTraceback:
[rank2]: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::string> const> (), c10::SetStackTraceFetcher(std::function<std::string ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
[rank2]: #5 c10::Error::Error(c10::SourceLocation, std::string) from ??:0
[rank2]: #6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) from ??:0
[rank2]: #7 at::(anonymous namespace)::check_type(at::TensorBase const&, c10::ScalarType, c10::basic_string_view<char>) from TensorMethods.cpp:0
[rank2]: #8 float const* at::TensorBase::const_data_ptr<float, 0>() const [clone .localalias] from TensorMethods.cpp:0
[rank2]: #9 void at::native::(anonymous namespace)::LayerNormKernelImplInternal<float, float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, float, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0
[rank2]: #10 at::native::(anonymous namespace)::LayerNormKernelImpl(at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, double, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0
[rank2]: #11 at::native::layer_norm_cuda(at::Tensor const&, c10::ArrayRef<long>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double) from ??:0
[rank2]: #12 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm(at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double) from RegisterCUDA.cpp:0
[rank2]: #13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm>, std::tuple<at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double> >, std::tuple<at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double) from RegisterCUDA.cpp:0
[rank2]: #14 at::_ops::native_layer_norm::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double) from ??:0
[rank2]: #15 torch::autograd::VariableType::(anonymous namespace)::native_layer_norm(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double) from VariableType_1.cpp:0
[rank2]: #16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor> (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double), &torch::autograd::VariableType::(anonymous namespace)::native_layer_norm>, std::tuple<at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double> >, std::tuple<at::Tensor, at::Tensor, at::Tensor> (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double) from VariableType_1.cpp:0
[rank2]: #17 at::_ops::native_layer_norm::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double) from ??:0
[rank2]: #18 at::native::layer_norm_symint(at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double, bool) from ??:0
[rank2]: #19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double, bool), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__layer_norm>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double, bool> >, at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double, bool) from RegisterCompositeImplicitAutograd.cpp:0
[rank2]: #20 at::_ops::layer_norm::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double, bool) from ??:0
[rank2]: #21 torch::autograd::THPVariable_layer_norm(_object*, _object*, _object*) from python_torch_functions_2.cpp:0
[rank2]: #22 cfunction_call from /usr/local/src/conda/python-3.10.16/Objects/methodobject.c:543
[rank2]: #23 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
[rank2]: #24 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
[rank2]: #25 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #26 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank2]: #27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #28 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #29 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #30 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #31 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #32 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
[rank2]: #33 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
[rank2]: #34 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
[rank2]: #35 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
[rank2]: #36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #37 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #39 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #41 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
[rank2]: #42 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
[rank2]: #43 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
[rank2]: #44 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
[rank2]: #45 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank2]: #47 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #48 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank2]: #49 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #50 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #52 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #53 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #54 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
[rank2]: #55 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
[rank2]: #56 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
[rank2]: #57 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
[rank2]: #58 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #59 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #60 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #61 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #62 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #63 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #64 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #65 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
[rank2]: #66 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
[rank2]: #67 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
[rank2]: #68 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #69 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #70 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #71 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #72 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #73 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #74 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #75 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #76 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
[rank2]: #77 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
[rank2]: #78 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
[rank2]: #79 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #80 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #81 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #82 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #83 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #84 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #85 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank2]: #86 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank2]: #87 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #88 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #89 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
...truncated
could you guys can help me?? pls