I met strange error when train rwkv

hello. I’m training rwkv now and struggling with “Cuda : illegal memory access” error.
But, the strange error message is coming from somewhere

sh scripts/train/rwkv3b_pretrain.sh
Current working directory: /home/gpuadmin/Desktop/RWKV/MK1
INFO:pytorch_lightning.utilities.rank_zero:########## work in progress ##########
[W1218 22:04:05.285517191 Module.cpp:178] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

[2024-12-18 22:04:05,746] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)


Using /home/gpuadmin/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/gpuadmin/.cache/torch_extensions/py310_cu124/wkv6state/build.ninja...
Building extension module wkv6state...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv6state...
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
True
INFO:pytorch_lightning.strategies.deepspeed:initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/4
[W1218 22:04:19.898821787 Module.cpp:178] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

[W1218 22:04:19.906643190 Module.cpp:178] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

[W1218 22:04:19.111964661 Module.cpp:178] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

[2024-12-18 22:04:19,365] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-18 22:04:19,367] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-12-18 22:04:19,585] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Using /home/gpuadmin/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/gpuadmin/.cache/torch_extensions/py310_cu124/wkv6state/build.ninja...
Building extension module wkv6state...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv6state...
Using /home/gpuadmin/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/gpuadmin/.cache/torch_extensions/py310_cu124/wkv6state/build.ninja...
Building extension module wkv6state...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv6state...
Using /home/gpuadmin/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/gpuadmin/.cache/torch_extensions/py310_cu124/wkv6state/build.ninja...
Building extension module wkv6state...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv6state...
True
INFO:pytorch_lightning.strategies.deepspeed:initializing deepspeed distributed: GLOBAL_RANK: 3, MEMBER: 4/4
True
INFO:pytorch_lightning.strategies.deepspeed:initializing deepspeed distributed: GLOBAL_RANK: 2, MEMBER: 3/4
True
INFO:pytorch_lightning.strategies.deepspeed:initializing deepspeed distributed: GLOBAL_RANK: 1, MEMBER: 2/4
INFO:pytorch_lightning.utilities.rank_zero:Enabling DeepSpeed BF16.
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Using /home/gpuadmin/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Using /home/gpuadmin/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Using /home/gpuadmin/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
INFO:pytorch_lightning.utilities.rank_zero:Name of trainable parameters in optimizers: 
INFO:pytorch_lightning.utilities.rank_zero:Number of trainable parameters in optimizers: 344
Using /home/gpuadmin/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/gpuadmin/.cache/torch_extensions/py310_cu124/fused_adam/build.ninja...
/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.027843475341796875 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.1026618480682373 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.1028742790222168 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10142779350280762 seconds
INFO:pytorch_lightning.callbacks.model_summary:
  | Name     | Type            | Params
---------------------------------------------
0 | rwkv     | RWKV            | 59.7 M
1 | vit      | CLIPVisionModel | 303 M 
2 | proj     | Linear          | 327 K 
3 | emb_spot | Linear          | 131 K 
---------------------------------------------
39.2 M    Trainable params
324 M     Non-trainable params
363 M     Total params
1,454.667 Total estimated model params size (MB)
Epoch 0:   0%|                                                                                                                                                  | 0/1000 [00:00<?, ?it/s][rank2]:[W1218 22:04:39.800672233 Module.cpp:178] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

[rank0]:[W1218 22:04:40.829596628 Module.cpp:178] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

[rank1]:[W1218 22:04:40.870118621 Module.cpp:178] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

[rank3]:[W1218 22:04:40.879996123 Module.cpp:178] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py:437: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check.
  rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.")
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/gpuadmin/Desktop/RWKV/MK1/train.py", line 240, in <module>
[rank2]:     trainer.fit(model, data_loader)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
[rank2]:     call._call_and_handle_interrupt(
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
[rank2]:     return trainer_fn(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
[rank2]:     self._run(model, ckpt_path=self.ckpt_path)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
[rank2]:     results = self._run_stage()
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
[rank2]:     self._run_train()
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
[rank2]:     self.fit_loop.run()
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
[rank2]:     self.advance(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
[rank2]:     self._outputs = self.epoch_loop.run(self._data_fetcher)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
[rank2]:     self.advance(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 213, in advance
[rank2]:     batch_output = self.batch_loop.run(kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
[rank2]:     self.advance(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
[rank2]:     outputs = self.optimizer_loop.run(optimizers, kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
[rank2]:     self.advance(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 202, in advance
[rank2]:     result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 249, in _run_optimization
[rank2]:     self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 370, in _optimizer_step
[rank2]:     self.trainer._call_lightning_module_hook(
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
[rank2]:     output = fn(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1754, in optimizer_step
[rank2]:     optimizer.step(closure=optimizer_closure)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
[rank2]:     step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 280, in optimizer_step
[rank2]:     optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 234, in optimizer_step
[rank2]:     return self.precision_plugin.optimizer_step(
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/deepspeed.py", line 132, in optimizer_step
[rank2]:     closure_result = closure()
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 149, in __call__
[rank2]:     self._result = self.closure(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 135, in closure
[rank2]:     step_output = self._step_fn()
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 419, in _training_step
[rank2]:     training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_strategy_hook
[rank2]:     output = fn(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 351, in training_step
[rank2]:     return self.model(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank2]:     ret_val = func(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1909, in forward
[rank2]:     loss = self.module(*inputs, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/pytorch_lightning/overrides/base.py", line 98, in forward
[rank2]:     output = self._forward_module.training_step(*inputs, **kwargs)
[rank2]:   File "/home/gpuadmin/Desktop/RWKV/MK1/src/model_state.py", line 466, in training_step
[rank2]:     logits, targets = self(batch)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/Desktop/RWKV/MK1/src/model_state.py", line 424, in forward
[rank2]:     logits = self.forward_without_last_image(
[rank2]:   File "/home/gpuadmin/Desktop/RWKV/MK1/src/model_state.py", line 812, in forward_without_last_image
[rank2]:     logits, state = self.bidirectional_forward(x)
[rank2]:   File "/home/gpuadmin/Desktop/RWKV/MK1/src/model_state.py", line 453, in bidirectional_forward
[rank2]:     x, state[i] = block(x, state[i])
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/Desktop/RWKV/MK1/src/model_state.py", line 262, in forward
[rank2]:     x = self.ln0(x.to(torch.float32)).to(self.args.dtype)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank2]:     return forward_call(*args, **kwargs)
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 217, in forward
[rank2]:     return F.layer_norm(
[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/functional.py", line 2900, in layer_norm
[rank2]:     return torch.layer_norm(
[rank2]: RuntimeError: expected scalar type Float but found BFloat16
[rank2]: Exception raised from check_type at aten/src/ATen/core/TensorMethods.cpp:12 (most recent call first):
[rank2]: C++ CapturedTraceback:
[rank2]: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::string> const> (), c10::SetStackTraceFetcher(std::function<std::string ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
[rank2]: #5 c10::Error::Error(c10::SourceLocation, std::string) from ??:0
[rank2]: #6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) from ??:0
[rank2]: #7 at::(anonymous namespace)::check_type(at::TensorBase const&, c10::ScalarType, c10::basic_string_view<char>) from TensorMethods.cpp:0
[rank2]: #8 float const* at::TensorBase::const_data_ptr<float, 0>() const [clone .localalias] from TensorMethods.cpp:0
[rank2]: #9 void at::native::(anonymous namespace)::LayerNormKernelImplInternal<float, float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, float, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0
[rank2]: #10 at::native::(anonymous namespace)::LayerNormKernelImpl(at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, double, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0
[rank2]: #11 at::native::layer_norm_cuda(at::Tensor const&, c10::ArrayRef<long>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double) from ??:0
[rank2]: #12 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm(at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double) from RegisterCUDA.cpp:0
[rank2]: #13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm>, std::tuple<at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double> >, std::tuple<at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double) from RegisterCUDA.cpp:0
[rank2]: #14 at::_ops::native_layer_norm::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double) from ??:0
[rank2]: #15 torch::autograd::VariableType::(anonymous namespace)::native_layer_norm(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double) from VariableType_1.cpp:0
[rank2]: #16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor> (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double), &torch::autograd::VariableType::(anonymous namespace)::native_layer_norm>, std::tuple<at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double> >, std::tuple<at::Tensor, at::Tensor, at::Tensor> (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double) from VariableType_1.cpp:0
[rank2]: #17 at::_ops::native_layer_norm::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double) from ??:0
[rank2]: #18 at::native::layer_norm_symint(at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double, bool) from ??:0
[rank2]: #19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double, bool), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__layer_norm>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double, bool> >, at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double, bool) from RegisterCompositeImplicitAutograd.cpp:0
[rank2]: #20 at::_ops::layer_norm::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, double, bool) from ??:0
[rank2]: #21 torch::autograd::THPVariable_layer_norm(_object*, _object*, _object*) from python_torch_functions_2.cpp:0
[rank2]: #22 cfunction_call from /usr/local/src/conda/python-3.10.16/Objects/methodobject.c:543
[rank2]: #23 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
[rank2]: #24 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
[rank2]: #25 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #26 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank2]: #27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #28 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #29 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #30 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #31 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #32 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
[rank2]: #33 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
[rank2]: #34 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
[rank2]: #35 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
[rank2]: #36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #37 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #39 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #41 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
[rank2]: #42 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
[rank2]: #43 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
[rank2]: #44 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
[rank2]: #45 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank2]: #47 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #48 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank2]: #49 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #50 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #52 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #53 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #54 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
[rank2]: #55 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
[rank2]: #56 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
[rank2]: #57 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
[rank2]: #58 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #59 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #60 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #61 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #62 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #63 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #64 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #65 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
[rank2]: #66 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
[rank2]: #67 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
[rank2]: #68 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #69 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #70 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #71 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #72 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #73 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #74 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #75 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #76 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
[rank2]: #77 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
[rank2]: #78 _PyObject_Call from /usr/local/src/conda/python-3.10.16/Objects/call.c:305
[rank2]: #79 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #80 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #81 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #82 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #83 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #84 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #85 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank2]: #86 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank2]: #87 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank2]: #88 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank2]: #89 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
...truncated

could you guys can help me?? pls

I don’t see the memory violation in your stacktrace, but a RuntimeError:

[rank2]:   File "/home/gpuadmin/anaconda3/envs/mk1/lib/python3.10/site-packages/torch/nn/functional.py", line 2900, in layer_norm
[rank2]:     return torch.layer_norm(
[rank2]: RuntimeError: expected scalar type Float but found BFloat16

I’m not familiar with your use case, but are you directly transforming input tensors to bfloat16 instead of using torch.autocast?

umm… yes I just using like .to(args.dtype). Have I use torch.autocast? Actually, I don’t know if I use amp. Sorry for that I’m newbie about pytorch_lightning…

‘amp_backend’: None, ‘amp_level’: None
I think I’m using default value, “using amp”

{'load_model': '', 'model_path': None, 'wandb': '', 'proj_dir': 'out/rwkv3b-v060_pretrain', 'run_name': 'demo_run', 'random_seed': -1, 'data_file': '/home/gpuadmin/Desktop/RWKV/blip_laion_cc_sbu_558k.json', 'data_type': 'json', 'vocab_size': 65536, 'ctx_len': 128, 'epoch_steps': 1000, 'epoch_count': 18, 'epoch_begin': 0, 'epoch_save': 0, 'micro_bsz': 1, 'n_layer': 12, 'n_embd': 320, 'dim_att': 320, 'dim_ffn': 1120, 'pre_ffn': 0, 'head_size_a': 64, 'head_size_divisor': 8, 'lr_init': 0.001, 'lr_final': 1e-05, 'warmup_steps': 0, 'beta1': 0.9, 'beta2': 0.99, 'adam_eps': 1e-08, 'grad_cp': 0, 'dropout': 0, 'weight_decay': 0, 'weight_decay_final': -1, 'ds_bucket_mb': 200, 'vision_tower_name': '/home/gpuadmin/Desktop/RWKV/myclip', 'image_folder': '/home/gpuadmin/Desktop/RWKV/images', 'grid_size': -1, 'detail': 'low', 'freeze_rwkv': 0, 'freeze_proj': 0, 'image_position': 'no', 'print_param_shape': 0, 'max_spots': 10, 'max_new_tokens': 128, 'stage': 1, 'pin_memory': True, 'logger': False, 'enable_checkpointing': False, 'default_root_dir': None, 'gradient_clip_val': 1.0, 'gradient_clip_algorithm': None, 'num_nodes': 1, 'num_processes': None, 'devices': '4', 'gpus': None, 'auto_select_gpus': None, 'tpu_cores': None, 'ipus': None, 'enable_progress_bar': True, 'overfit_batches': 0.0, 'track_grad_norm': -1, 'check_val_every_n_epoch': 100000000000000000000, 'fast_dev_run': False, 'accumulate_grad_batches': 1, 'max_epochs': 18, 'min_epochs': None, 'max_steps': -1, 'min_steps': None, 'max_time': None, 'limit_train_batches': None, 'limit_val_batches': None, 'limit_test_batches': None, 'limit_predict_batches': None, 'val_check_interval': None, 'log_every_n_steps': 100000000000000000000, 'accelerator': 'gpu', 'strategy': 'deepspeed', 'sync_batchnorm': False, 'precision': 'bf16', 'enable_model_summary': True, 'num_sanity_val_steps': 0, 'resume_from_checkpoint': None, 'profiler': None, 'benchmark': None, 'reload_dataloaders_every_n_epochs': 0, 'auto_lr_find': False, 'replace_sampler_ddp': False, 'detect_anomaly': False, 'auto_scale_batch_size': False, 'plugins': None, 'amp_backend': None, 'amp_level': None, 'move_metrics_to_cpu': False, 'multiple_trainloader_mode': 'max_size_cycle', 'inference_mode': True, 'my_timestamp': '2024-12-19-09-29-11', 'betas': (0.9, 0.99), 'real_bsz': 4}
It is my argument

If forcing a dtype change is prohibited in AMP, shouldn’t I be specifying a dtype at all? Right now I’m forcing dtype in the dataset and model implementation, where the dataset is based on args.dtype and the model implementation is based on the dtype of other variables that should have the same dtype.

I’ve tried removing the dtype specification and it doesn’t work…
It’s possible that my dataset and model are declared incorrectly, so I’ll send you the lines of code for the declaration part.

    from src.trainer import train_callback
    from src.dataset import MyDataset
    from src.rwkv_tokenizer import TRIE_TOKENIZER
    from transformers import AutoImageProcessor

    args.tokenizer = TRIE_TOKENIZER("src/rwkv_vocab_v20230424.txt")
    args.image_processor = AutoImageProcessor.from_pretrained(args.vision_tower_name)

    train_data = MyDataset(args)
    args.vocab_size = train_data.vocab_size

    from src.model_state import RWKV_II
    # 256gb cpu memory is not enough for 8 gpus
    # to use 6 gpus on 256gb cpu memory, use .half() to save memory
    model = RWKV_II(args).half()
    if args.model_path:
        msg = model.load_state_dict(torch.load(args.model_path, map_location='cpu'), strict=False)
        rank_zero_info(f"loading visual rwkv model from {args.model_path}: {msg}")
    if args.freeze_rwkv > 0:
        model.freeze_rwkv(args.freeze_rwkv)
    if args.freeze_proj > 0:
        model.freeze_proj()
    model.freeze_emb() # freeze emb all the time

    trainer = Trainer.from_argparse_args(args, callbacks=[train_callback(args)])

    if "deepspeed" in args.strategy:
        trainer.strategy.config["zero_optimization"]["allgather_bucket_size"] = args.ds_bucket_mb * 1000 * 1000
        trainer.strategy.config["zero_optimization"]["reduce_bucket_size"] = args.ds_bucket_mb * 1000 * 1000
    
    data_loader = DataLoader(train_data, shuffle=False, pin_memory=True, batch_size=args.micro_bsz, num_workers=1, 
                             persistent_workers=False, drop_last=True)
    from pytorch_lightning.strategies import DeepSpeedStrategy
    print(isinstance(trainer.strategy, DeepSpeedStrategy))
    trainer.fit(model, data_loader)

I deleted some print lines for debugging

I’ll hate F.layer_norm from now on

:frowning:

Manually casting the inputs and model might work, but the recommended approach is to use torch.autocast and let PyTorch perform the transformations for you.
Take a look at torch.amp for more information and examples and let me know if it fixes your error.