Torch.linalg.solve large matrix cuda error

Hi, I am trying to execute the following commands on cuda but facing the error. The code is running fine on CPU and with smaller matrix size (like 400 instead of 4096 with cuda). I am using Volta100 GPU with pytorch 1.9 and cuda 11.1. Can anyone recommend any workaround? Thanks.

A = torch.randn(2, 3, 4096).cuda()
B = torch.randn(2, 3, 3).cuda()
X = torch.linalg.solve( B,A)

RuntimeError Traceback (most recent call last)
/tmp/ipykernel_29550/ in
1 A = torch.randn(2, 3, 4096).cuda()
2 B = torch.randn(2, 3, 3).cuda()
----> 3 X = torch.linalg.solve( B,A)
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Could you create an issue on GitHub so that we can track and fix it, please?

Thanks for the response. I have posted in on github.