RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

I am trying to build a simple Faster RCNN model, I have tested till rpn section, everything works fine until the forward pass of Final_net, I am clueless on what is causing the error. Here is the link to the kaggle notebook, detailed error in the 3rd last cell: Final_IMD | Kaggle

Which PyTorch version are you using? If an older one, could you update to 1.9.1 or the nightly and rerun your script? We were hitting a missing shape check when calling into cublas via matmul ops, which might cause this downstream error.

Updated pytorch, still the same error.

Could you post a minimal, executable code snippet to reproduce the issue as well as the output of python -m torch.utils.collect_env, please?