Strange CUDA error when backward

I am getting this error when training a cutomized language model on top of pretrained BERT:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-15-1cec27866970> in <module>
----> 1 trained_model = train(model, optimizer, dataloader, num_epochs=3)

<ipython-input-10-c6b5f4bdbacf> in train(model, optimizer, dataloader, num_epochs)
     31             loss = outputs[0]
     32             # In training phase, backprop and optimize
---> 33             loss.backward()
     34             optimizer.step()
     35             # Compute running loss/accuracy

~\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\tensor.py in backward(self, gradient, retain_graph, create_graph)
    183                 products. Defaults to ``False``.
    184         """
--> 185         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    186 
    187     def register_hook(self, hook):

~\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\autograd\__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
    125     Variable._execution_engine.run_backward(
    126         tensors, grad_tensors, retain_graph, create_graph,
--> 127         allow_unreachable=True)  # allow_unreachable flag
    128 
    129 

RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
Exception raised from gemm at ..\aten\src\ATen\cuda\CUDABlas.cpp:165 (most recent call first):
00007FFBAF5375A200007FFBAF537540 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]
00007FFB3974384600007FFB39742810 torch_cuda.dll!at::native::sparse_mask_cuda [<unknown file> @ <unknown line number>]
00007FFB38C4C89700007FFB38C4B790 torch_cuda.dll!at::native::lerp_cuda_tensor_out [<unknown file> @ <unknown line number>]
00007FFB38C4E2D200007FFB38C4DD60 torch_cuda.dll!at::native::addmm_out_cuda [<unknown file> @ <unknown line number>]
00007FFB38C4F44300007FFB38C4F360 torch_cuda.dll!at::native::mm_cuda [<unknown file> @ <unknown line number>]
00007FFB397B1E6F00007FFB3974E400 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FFB397A1E8200007FFB3974E400 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FFB5E0CD94900007FFB5E0C8FA0 torch_cpu.dll!at::bucketize_out [<unknown file> @ <unknown line number>]
00007FFB5E10057700007FFB5E100520 torch_cpu.dll!at::mm [<unknown file> @ <unknown line number>]
00007FFB5F45EC7900007FFB5F36E010 torch_cpu.dll!torch::autograd::GraphRoot::apply [<unknown file> @ <unknown line number>]
00007FFB5DC1715700007FFB5DC16290 torch_cpu.dll!at::indexing::TensorIndex::boolean [<unknown file> @ <unknown line number>]
00007FFB5E0CD94900007FFB5E0C8FA0 torch_cpu.dll!at::bucketize_out [<unknown file> @ <unknown line number>]
00007FFB5E1E210700007FFB5E1E20B0 torch_cpu.dll!at::Tensor::mm [<unknown file> @ <unknown line number>]
00007FFB5F2D1F1600007FFB5F2D1B30 torch_cpu.dll!torch::autograd::generated::MmBackward::apply [<unknown file> @ <unknown line number>]
00007FFB5F2A7E9100007FFB5F2A7B50 torch_cpu.dll!torch::autograd::Node::operator() [<unknown file> @ <unknown line number>]
00007FFB5F80F9BA00007FFB5F80F300 torch_cpu.dll!torch::autograd::Engine::add_thread_pool_task [<unknown file> @ <unknown line number>]
00007FFB5F8103AD00007FFB5F80FFD0 torch_cpu.dll!torch::autograd::Engine::evaluate_function [<unknown file> @ <unknown line number>]
00007FFB5F814FE200007FFB5F814CA0 torch_cpu.dll!torch::autograd::Engine::thread_main [<unknown file> @ <unknown line number>]
00007FFB5F814C4100007FFB5F814BC0 torch_cpu.dll!torch::autograd::Engine::thread_init [<unknown file> @ <unknown line number>]
00007FFB78030A2700007FFB7800A100 torch_python.dll!THPShortStorage_New [<unknown file> @ <unknown line number>]
00007FFB5F80BF1400007FFB5F80B780 torch_cpu.dll!torch::autograd::Engine::get_base_engine [<unknown file> @ <unknown line number>]
00007FFBC8A6E3FE00007FFBC8A6E3A0 ucrtbase.dll!o_strcat_s [<unknown file> @ <unknown line number>]
00007FFBCB05403400007FFBCB054020 KERNEL32.DLL!BaseThreadInitThunk [<unknown file> @ <unknown line number>]
00007FFBCB8A369100007FFBCB8A3670 ntdll.dll!RtlUserThreadStart [<unknown file> @ <unknown line number>]

This error happens after 2 epochs of training with batch size 2, where I got CUDA memory error for larger batch size.

Thanks in advance!

Just reduce the batch size.