I am training my models from
Google Collab with
batch_size = 128 after 1 epoch it has this problem. I don’t know have to fix it with the same batch_size (reduce batch_size to 32 can avoid this problem). Here is Colab spec:
driver Version: 460.32.03 CUDA Version: 11.2
You can find my notebook here.
Thanks for your help.
I am training my models from
It seems that one of your operands is too large to fit in int32 (or negative, but that seems unlikely).
I thought that recent PyTorch will give a better error (but don’t work around it):
import torch LARGE = 2**31+1 for i, j, k in [(1, 1, LARGE), (1, LARGE, 1), (LARGE, 1, 1)]: inp = torch.randn(i, k, device="cuda", dtype=torch.half) weight = torch.randn(j, k, device="cuda", dtype=torch.half) try: torch.nn.functional.linear(inp, weight) except RuntimeError as e: print(e) del inp del weight
at::cuda::blas::gemm<float> argument k must be non-negative and less than 2147483647 but got 2147483649 at::cuda::blas::gemm<float> argument m must be non-negative and less than 2147483647 but got 2147483649 at::cuda::blas::gemm<float> argument n must be non-negative and less than 2147483647 but got 2147483649
But they don’t work around it. (It needs a lot of memory to trigger the bug…)
Maybe you can get a credible backtrace and record the input shapes to the operation that fails.
So what can I do to solve this problem, I just know to change batch size to smaller.
In order of difficulty:
- make batch size smaller,
- make a minimal reproducing example (i.e. just two or three inputs from torch.random and the call to the torch.nn.functional.linear) and file a bug,
- hot-patch torch.nn.functional.linear with a workaround (splitting the operation into multiple linear or matmul calls),
- submit a PR with a fix in PyTorch and discuss whether you can add a test or whether it’d take a prohibitive large amount of GPU memory to run (or hire someone to do so).
Thank for your help.
For the peoples getting this error and ending up on this post, please know that it can also be caused if you have a mismatch between the dimension of your input tensor and the dimensions of your nn.Linear module. (ex. x.shape = (a, b) and nn.Linear(c, c, bias=False) with c not matching)
It is a bit sad that pytorch don’t give a more explicit error messages.
@Jeremy_Cochoy This was really helpful. Solved my issue.
@Jeremy_Cochoy Thanks for your comments!
I have added an nn.Linear(512,10) layer to my model and the shape of the input that goes into this layer is torch.Size([32,512,1,1]). I have tried reducing the batch size from 128 to 64 and now to 32, but each of these gives me the same error.
Any idea what could be going wrong?
I think you want to transpose the dimensions of your input tensor before and after (Linear — PyTorch 1.9.0 documentation say it expect a Nx*xC_in tensor and you give him a 32x…x1 tensor)