Free(): invalid pointer Aborted (core dumped) on loss.backward()

I am trying to build a next word prediction model with bidirectional LSTM using
AdaptiveLogSoftmaxWithLoss and Adam optimizer. However, after processing 40 batches of data, I’m getting free(): invalid pointer, Aborted (core dumped). The code runs fine after removing the loss.backward() function. So, I decided that the error had to do something with that line. I ran the code in gdp and here’s the trace.

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7de5859 in __GI_abort () at abort.c:79
#2  0x00007ffff7e503ee in __libc_message (action=action@entry=do_abort, 
    fmt=fmt@entry=0x7ffff7f7a285 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007ffff7e5847c in malloc_printerr (
    str=str@entry=0x7ffff7f7c9f8 "malloc(): memory corruption (fast)")
    at malloc.c:5347
#4  0x00007ffff7e5b5bc in _int_malloc (av=av@entry=0x7ffe7c000020, 
    bytes=bytes@entry=32) at malloc.c:3594
#5  0x00007ffff7e5ed15 in __libc_calloc (n=<optimized out>, 
    elem_size=<optimized out>) at malloc.c:3428
#6  0x00007fff338064f6 in cudnnHostCalloc(unsigned long, unsigned long) ()
   from /home/akib/.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so
#7  0x00007fff33b77d0e in RNNBackwardData<float, float, float>::init(cudnnContext*, cudnnRNNStruct*, int, PerfOptions) ()
   from /home/akib/.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so
#8  0x00007fff32901097 in cudnnStatus_t RNN_DGRAD_LaunchTemplate<float, float, float, true>(cudnnContext*, cudnnRNNStruct*, int, cudnnTensorStruct* const*, void const*, void const*, void const*, void const*, void const*, void const*, cudnnTensorStruct* const*, void*, void*, void*, void const*, void*, void*, PerfOptions, bool) ()
--Type <RET> for more, q to quit, c to continue without paging--c
   from /home/akib/.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so
#9  0x00007fff33b57b75 in cudnnRNNBackwardData () from /home/akib/.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so
#10 0x00007fff32d175a8 in at::native::_cudnn_rnn_backward_input(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, long, long, bool, double, bool, bool, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::array<bool, 3ul>) () from /home/akib/.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so
#11 0x00007fff32d1acf0 in at::native::_cudnn_rnn_backward(at::Tensor const&, c10::ArrayRef<at::Tensor>, long, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, long, long, bool, double, bool, bool, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::array<bool, 4ul>) () from /home/akib/.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so
#12 0x00007fff9506d854 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > > (at::Tensor const&, c10::ArrayRef<at::Tensor>, long, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, long, long, long, long, bool, double, bool, bool, c10::ArrayRef<long>, c10::optional<at::Tensor> const&, at::Tensor const&, std::array<bool, 4ul>), &c10::impl::detail::with_explicit_optional_tensors_<std::tuple<at::Tensor, at::Tensor, at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > > (at::Tensor const&, c10::ArrayRef<at::Tensor>, long, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, long, long, long, long, bool, double, bool, bool, c10::ArrayRef<long>, c10::optional<at::Tensor> const&, at::Tensor const&, std::array<bool, 4ul>), std::tuple<at::Tensor, at::Tensor, at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > > (at::Tensor const&, c10::ArrayRef<at::Tensor>, long, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, long, long, bool, double, bool, bool, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::array<bool, 4ul>), c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > > (at::Tensor const&, c10::ArrayRef<at::Tensor>, long, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, long, long, bool, double, bool, bool, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::array<bool, 4ul>), &at::(anonymous namespace)::(anonymous namespace)::wrapper__cudnn_rnn_backward> >::wrapper>, std::tuple<at::Tensor, at::Tensor, at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > >, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<at::Tensor>, long, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, long, long, long, long, bool, double, bool, bool, c10::ArrayRef<long>, c10::optional<at::Tensor> const&, at::Tensor const&, std::array<bool, 4ul> > >, std::tuple<at::Tensor, at::Tensor, at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > > (at::Tensor const&, c10::ArrayRef<at::Tensor>, long, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, long, long, long, long, bool, double, bool, bool, c10::ArrayRef<long>, c10::optional<at::Tensor> const&, at::Tensor const&, std::array<bool, 4ul>)>::call(c10::OperatorKernel*, at::Tensor const&, c10::ArrayRef<at::Tensor>, long, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, long, long, long, long, bool, double, bool, bool, c10::ArrayRef<long>, c10::optional<at::Tensor> const&, at::Tensor const&, std::array<bool, 4ul>) () from /home/akib/.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cu.so
#13 0x00007fff834b6687 in at::_cudnn_rnn_backward(at::Tensor const&, c10::ArrayRef<at::Tensor>, long, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, long, long, long, long, bool, double, bool, bool, c10::ArrayRef<long>, c10::optional<at::Tensor> const&, at::Tensor const&, std::array<bool, 4ul>) () from /home/akib/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#14 0x00007fff84cc3c59 in torch::autograd::VariableType::(anonymous namespace)::_cudnn_rnn_backward(at::Tensor const&, c10::ArrayRef<at::Tensor>, long, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, long, long, long, long, bool, double, bool, bool, c10::ArrayRef<long>, c10::optional<at::Tensor> const&, at::Tensor const&, std::array<bool, 4ul>) () from /home/akib/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#15 0x00007fff84cc4582 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > > (at::Tensor const&, c10::ArrayRef<at::Tensor>, long, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, long, long, long, long, bool, double, bool, bool, c10::ArrayRef<long>, c10::optional<at::Tensor> const&, at::Tensor const&, std::array<bool, 4ul>), &torch::autograd::VariableType::(anonymous namespace)::_cudnn_rnn_backward>, std::tuple<at::Tensor, at::Tensor, at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > >, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<at::Tensor>, long, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, long, long, long, long, bool, double, bool, bool, c10::ArrayRef<long>, c10::optional<at::Tensor> const&, at::Tensor const&, std::array<bool, 4ul> > >, std::tuple<at::Tensor, at::Tensor, at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > > (at::Tensor const&, c10::ArrayRef<at::Tensor>, long, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, long, long, long, long, bool, double, bool, bool, c10::ArrayRef<long>, c10::optional<at::Tensor> const&, at::Tensor const&, std::array<bool, 4ul>)>::call(c10::OperatorKernel*, at::Tensor const&, c10::ArrayRef<at::Tensor>, long, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, long, long, long, long, bool, double, bool, bool, c10::ArrayRef<long>, c10::optional<at::Tensor> const&, at::Tensor const&, std::array<bool, 4ul>) () from /home/akib/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#16 0x00007fff834b6687 in at::_cudnn_rnn_backward(at::Tensor const&, c10::ArrayRef<at::Tensor>, long, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, long, long, long, long, bool, double, bool, bool, c10::ArrayRef<long>, c10::optional<at::Tensor> const&, at::Tensor const&, std::array<bool, 4ul>) () from /home/akib/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#17 0x00007fff84c08026 in torch::autograd::generated::CudnnRnnBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) () from /home/akib/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#18 0x00007fff85275771 in torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) () from /home/akib/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#19 0x00007fff8527157b in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) () from /home/akib/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#20 0x00007fff8527219f in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) () from /home/akib/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#21 0x00007fff85269979 in torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) () from /home/akib/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#22 0x00007ffff59ed293 in torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) () from /home/akib/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so
#23 0x00007ffff6a04d84 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#24 0x00007ffff7da6609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#25 0x00007ffff7ee2293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I searched for some solutions and tried reinstalling pytorch and adding gc.collect() before calling backward(). I’m adding a snippet of my code for further reference

class BLSTM(torch.nn.Module):
    def __init__(self, emb_size, hidden_size, num_layers, vocab_size, cutoffs):
        super(BLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.blstm = torch.nn.LSTM(emb_size, hidden_size, num_layers, batch_first=True, bidirectional=True)
        self.adasoft = torch.nn.AdaptiveLogSoftmaxWithLoss(hidden_size*2, vocab_size, cutoffs)
    
    def forward(self, x, targets):
        # Set initial states
        h0 = torch.zeros(self.num_layers*2, x.size(0), self.hidden_size).to(device)
        c0 = torch.zeros(self.num_layers*2, x.size(0), self.hidden_size).to(device)

        # Forward propagate LSTM
        embed = torch.nn.Embedding(vocab_size, emb_size).to(device)
        out, _ = self.blstm(embed(x), (h0, c0))  # out: tensor of shape (batch_size, seq_length, hidden_size*2)
        
        # Decode the hidden state of the last time step
        out = self.adasoft(out[:, -1, :], targets)
        
        return out
optimizer.zero_grad()
outputs = model(input_seq, target_seq)
outputs.loss.backward()
optimizer.step()

I am using CUDA11.2 in Ubuntu20.04 machine.

It seems that a cudnnHostCalloc call is failing. How did you install PyTorch and which cudnn version are you using?
Since you mentioned CUDA11.2, did you build PyTorch from source or are you using the binaries?
Also, which GPU are you using?

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

I installed pytorch through pip using the command created by pytorch.org

pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

And my GPU is NVIDIA Geforce GTX 1050Ti

However, I’m not sure how to check CuDNN version. This is what I see after running the installation file- “libcudnn8_8.1.1.33-1+cuda11.2_amd64.deb”
Screenshot from 2021-03-30 15-10-01
I have never set up a GPU device before. So I’m not sure what I’m doing to be honest.

Thanks for the update.
Based on your description you’ve installed the pip wheels, which ship with their CUDA runtime as well as cudnn (8.0.5). Your local CUDA toolkit and cudnn won’t be used in this case.

Since you are using a local CUDA10.1 installation, I guess your NVIDIA driver might be a bit older?
Could you check it via nvidia-smi and make sure it’s new enough to run the CUDA11.1 runtime from the pip wheels? (Table1 gives you an overview of the driver requirements)

Thanks for the heads up. Here is the nvidia-smi result. The driver version is 460 which should be able to run CUDA 11.2 according to the Table 1 you mentioned.