The CUDA kernel of torch.nn.CrossEntropyLoss fails when the input tensor is too large

I got a problem with the CUDA kernel of torch.nn.CrossEntropyLoss. It fails in its forward function caused by an illegal memory access. I posted my issue in github. Hope it will get reponsed quickly.

For large input tensors currently, I divide the input tensor to multiple segments and call multiple times of F.cross_entropy to these segments. Is there any good way to apply cross entropy loss to large tensors?

Yes, GitHub issues are seen and it seems to be a duplicate of CUDA Illegal memory access on CrossEntropyLoss with large batch size, cu113, torch 1.12.1 · Issue #85005 · pytorch/pytorch · GitHub