Pytorch tensor abnormal, how to debug

I have a program that produces a tensor, I can print the shape,dtype of this tensor, and it looks good. when I try to print the value of this tensor by print(x), there goes the error message:

(Pdb)*** RuntimeError: numel: integer multiplication overflow
(Pdb)*** RuntimeError: numel: integer multiplication overflow
(Pdb)*** RuntimeError: CUDA error: misaligned addressCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect

But I do not know when this happed, it will only become a error when this tensor is accessed.

How to figure this out? Thanks a lot!

Could you post a minimal and executable code snippet reproducing the error as well as the output of python -m torch.utils.collect_env, please?

Thanks for your reply! it requires several libs to reproduce this, if this really needed, I will provide one.
I will provide more info about this error:

the program needs to compute q = self.q_norm(q.transpose(1, 2).flatten(-2, -1)), and I debugged the input of this code:

-> q = self.q_norm(q.transpose(1, 2).flatten(-2, -1))[0].view(B_, N_, H_, D_).transpose(1, 2)                    
p q.shape                                                                                                        
(Pdb) torch.Size([2, 25, 257, 48])                                                                               
p q.dtype                                                                                                        
(Pdb) torch.bfloat16                                                                                             
p q                                                                                                              
(Pdb) tensor([[[[-1.2506e-12, -1.5010e-13,  3.3396e-13,  ...,  5.8265e-13,                                       
           -1.3589e-13, -5.2580e-13],                                                                            
          [-1.2434e-12, -1.5454e-13,  3.3218e-13,  ...,  5.7909e-13,                                             
           -1.4211e-13, -5.3646e-13],                                                                            
          [-1.2506e-12, -1.4833e-13,  3.3573e-13,  ...,  5.8620e-13,                                             
           -1.4033e-13, -5.3291e-13],                                                                            

the tensor looks good to me, but if i run self.q_norm I will get: (Pdb) *** RuntimeError: CUDA error: misaligned address

Without a code snippet to be able to reproduce it, it might not be possible for me to debug the issue.
Since you are apparently able to reproduce it, would it be possible to check your workload with compute-sanitizer python args to check if it detects any issues?

Thanks. I runned with compute-sanitizer and got:

========= Invalid __global__ read of size 16 bytes
=========     at 0x1e0 in void layer_norm::ln_fwd_kernel<layer_norm::Kernel_traits<__nv_bfloat16, __nv_bfloat16, __nv_bfloat16, __nv_bfloat16, float, unsigned int, (unsigned int)1280, (unsigned int)1, (unsigned int)4, (unsigned int)1, (unsigned int)16, layer_norm::Kernel_traits_base<(unsigned int)1280, __nv_bfloat16, __nv_bfloat16, __nv_bfloat16, __nv_bfloat16, float, unsigned int, (unsigned int)128>>, (bool)0, (bool)0, (bool)0, (bool)0>(layer_norm::FwdParams)
=========     by thread (105,0,0) in block (26,0,0)
=========     Address 0x7f02facf1832 is misaligned

Thank you! Could you post the configs of all layernorm layers and the activation input shape so that I could try to create a code snippet to reproduce it?

Thanks. It requires some packages to reproduce this, might be complicated:


pytorch.12+cu116 or Pytorch1.13+cu117+gcc10.2


# modified deepspeed
git clone
cd DeepSpeed 
git checkout fmoe_v0.9.0
pip install .

git clone
cd fastmoe
# make sure you have nccl, if not, need to export NCCL path, both 2.10 and 2.14 are tested
export NCCL_PATH=/mnt/.../nccl_2.10.3-1+cuda11.4_x86_64/

# install with gpu machine
python install

# flash attn:
git clone
cd flash-attention
# install with gpu machine
python install
# install layer norm cuda ext
# these might get stuck, maybe related to g++ version
cd ../fused_dense_lib
python install

cd csrc/layer_norm
python install

and finally, the script:
TorchDistPackage/explore/ at 2439664b41f69d9f228fef5bb2c0b40808009b87 · KimmiShi/TorchDistPackage · GitHub
run this script with a gpu, if not using slurm, you need to modify the line +602 setup_distributed_slurm() to your version that does torch.distributed.initialize(...)

The reason why this reproduce script requires a lot of deps is that the bug show up only when using fmoe+deepspeed+flash attn layernorm together. If only using deepspeed+flash attn layernorm there is no problem.

Hi, I have some new info:

I modified the code, to only create the problematic module(which may need a cuda ext) but not to use it to do any calculation, there is still the problem, when I remove the module completely, the program runs well.

I will invetigate into the following two possibility:

  1. the problematic module params caused the bug
  2. the problematic module loaded cuda extention caused the bug

It seems that a nn.Linear(dim, 1) in a module not related to layer_norm causes the problem. The shape of nn.Linear matters.
I tried nn.Linear(dim, 2), nn.Linear(dim, 4) these does not work;

Only when the dim is a multiple of 8 the program work.

I guess this is related to memory management, and maybe related to deepspeed.

No, I don’t think the linear layer is related and believe any CUDA operation could be the victim after the problematic ln_fwd_kernel kernel call as it could create a sticky error.
Were you able to pull out the layernorm settings?

Hi @ptrblck ,
So you mean that ln_fwd_kernel is the real problem?

about this op setting:

for python api:
flash-attention/flash_attn/ops/ at main · HazyResearch/flash-attention (
this python module is created with flash_attn.ops.rms_norm.DropoutAddRMSNorm(1200, eps=1e-06, prenorm=True)

and calls to flash-attention/flash_attn/ops/ at main · HazyResearch/flash-attention · GitHub

zmat, xmat, dmask, mu, rsigma = dropout_layer_norm.dropout_add_ln_fwd(
        x0mat, residualmat, gamma, beta, rowscale, colscale, None, None, dropout_p, epsilon,
        1.0, 0, None, residual_in_fp32, is_rms_norm

with args:

x0mat.shape: torch.Size([514, 1200])
gamma.shape: torch.Size([1200])
residualmat: None
rowscale None
colscale None
dropout_p  0.0
is_rms_norm True
residual_in_fp32 False

Yes, compute-sanitizer shows the root cause in ln_fwd_kernel every other operation might only be the victim afterwards.

Does it make sense that ln_fwd_kernel requires 16B alignment, but the input data does not meet this requirement?. Is it reasonable for this kernel to have this requirement? Do I need to fix this kernel impl, any suggestions?

I doubt the input data is wrong and I guess some internal buffers might be misaligned, but I also didn’t take a deeper look into the code.

It’s reasonable to expect an (internal) alignment, but it’s unreasonable to expect something from the user since users do not explicitly define any memory alignment guarantees in their code.

Assuming you are familiar with CUDA you could build the extension with debug symbols to narrow down which operation fails and why.
I also see you’ve created an issue in their repository so I would also expect the code owners to take a look.

Thanks a lot! I will try to dive deeper into this bug and update this post if I had any new findings.