Error in `python': double free or corruption (fasttop)

Issue:
I am getting a segfault error when I train an RNN model using num_workers>0 on the Dataloader. The data is being loaded from a kyoto cabinet file, using a custom collate function.

Question:
What can I do to solve this problem?

Trying to isolate the issue:
If I comment the code related to the loss function/optimization (nn.BCELoss, backward, adam) inside the train batch generator loop, the code works fine for several epochs. If I train with num_workers=0 on the Dataloader, there are no issues either and I can also train for several epochs (very slow though). When I uncomment the loss/optimization code, it usually works for the first half of the mini-batches but then I get the memory error on the second half - always on the first epoch.

Environment:

PyTorch Version: 1.1.0
conda install pytorch torchvision cudatoolkit=10.0 -c pytorch

Anaconda version: Anaconda3-2019.03-Linux-x86_64.sh

$ python --version
Python 3.7.3

$ cat /usr/local/cuda/version.txt
CUDA Version 10.1.168

$ cat /etc/os-release
NAME=“CentOS Linux”
VERSION=“7 (Core)”
ID=“centos”
ID_LIKE=“rhel fedora”
VERSION_ID=“7”
PRETTY_NAME=“CentOS Linux 7 (Core)”
ANSI_COLOR=“0;31”
CPE_NAME=“cpe:/o:centos:centos:7”
HOME_URL=“https://www.centos.org/
BUG_REPORT_URL=“https://bugs.centos.org/

CENTOS_MANTISBT_PROJECT=“CentOS-7”
CENTOS_MANTISBT_PROJECT_VERSION=“7”
REDHAT_SUPPORT_PRODUCT=“centos”
REDHAT_SUPPORT_PRODUCT_VERSION=“7”

$ gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
Copyright © 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Error log:
*** Error in `python’: double free or corruption (fasttop): 0x00007f39fc04af40 *** time: 0:12:22
======= Backtrace: =========
/usr/lib64/libc.so.6(+0x81609)[0x7f3b58f82609]
/usr/lib64/libcuda.so.1(+0x2011f2)[0x7f3afb3c71f2]
/usr/lib64/libcuda.so.1(+0x10abc2)[0x7f3afb2d0bc2]
/usr/lib64/libcuda.so.1(+0x10ac39)[0x7f3afb2d0c39]
/usr/lib64/libcuda.so.1(cuStreamCreate+0x5b)[0x7f3afb41635b]
/opt/mapr/tools/python/anaconda3/lib/python3.7/site-packages/torch/lib/…/…/…/…/libcudart.so.10.0(+0xffa2)[0x7f3b430ecfa2]
/opt/mapr/tools/python/anaconda3/lib/python3.7/site-packages/torch/lib/…/…/…/…/libcudart.so.10.0(cudaStreamCreate+0x64)[0x7f3b43126874]
/opt/mapr/tools/python/anaconda3/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so(_ZN17RNNBackwardFilterIfffE4initEP12cudnnContextP14cudnnRNNStructi11PerfOptions+0x3b0)[0x7f3b16a725c0]
/opt/mapr/tools/python/anaconda3/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so(_Z24RNN_WGRAD_LaunchTemplateIfffE13cudnnStatus_tP12cudnnContextP14cudnnRNNStructiPKP17cudnnTensorStructPKvSA_SA_SA_mPvSA_m11PerfOptions+0x80)[0x7f3b16a76270]
/opt/mapr/tools/python/anaconda3/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so(cudnnRNNBackwardWeights+0xf02)[0x7f3b16a71602]
/opt/mapr/tools/python/anaconda3/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so(ZN2at6native26_cudnn_rnn_backward_weightERKNS_6TensorEN3c108ArrayRefIS1_EElS3_S3_S3_S3_lllbdbbNS5_IlEES3_S3+0xc0b)[0x7f3b142accbb]
/opt/mapr/tools/python/anaconda3/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so(_ZN2at6native19_cudnn_rnn_backwardERKNS_6TensorEN3c108ArrayRefIS1_EElS3_S3_S3_S3_S3_S3_S3_lllbdbbNS5_IlEES3_S3_St5arrayIbLm4EE+0x2f6)[0x7f3b142b3106]
/opt/mapr/tools/python/anaconda3/lib/python3.7/site-packages/torch/lib/libcaffe2_gpu.so(_ZNK2at8CUDAType19_cudnn_rnn_backwardERKNS_6TensorEN3c108ArrayRefIS1_EElS3_S3_S3_S3_S3_S3_S3_lllbdbbNS5_IlEES3_S3_St5arrayIbLm4EE+0x178)[0x7f3b1438a7f8]
/opt/mapr/tools/python/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so.1(_ZNK5torch8autograd12VariableType19_cudnn_rnn_backwardERKN2at6TensorEN3c108ArrayRefIS3_EElS5_S5_S5_S5_S5_S5_S5_lllbdbbNS7_IlEES5_S5_St5arrayIbLm4EE+0x1101)[0x7f3b12421f81]
/opt/mapr/tools/python/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so.1(_ZN5torch8autograd9generated16CudnnRnnBackward5applyEOSt6vectorINS0_8VariableESaIS4_EE+0x6d4)[0x7f3b121c9f04]
/opt/mapr/tools/python/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so.1(+0x307622)[0x7f3b121ab622]
/opt/mapr/tools/python/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so.1(_ZN5torch8autograd6Engine17evaluate_functionERNS0_12FunctionTaskE+0x385)[0x7f3b121a4745]
/opt/mapr/tools/python/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so.1(_ZN5torch8autograd6Engine11thread_mainEPNS0_9GraphTaskE+0xc0)[0x7f3b121a6740]
/opt/mapr/tools/python/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so.1(_ZN5torch8autograd6Engine11thread_initEi+0x2b0)[0x7f3b121a39e0]
/opt/mapr/tools/python/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so(_ZN5torch8autograd6python12PythonEngine11thread_initEi+0x2a)[0x7f3b3efc228a]
/opt/mapr/tools/python/anaconda3/lib/libstdc++.so.6(+0xb8678)[0x7f3b57e88678]
/usr/lib64/libpthread.so.0(+0x7dd5)[0x7f3b592d5dd5]
/usr/lib64/libc.so.6(clone+0x6d)[0x7f3b58fff02d]
======= Memory map: ========
200000000-200200000 rw-s 00000000 00:05 21694 /dev/nvidiactl
200200000-200600000 —p 00000000 00:00 0
200600000-200800000 rw-s 00000000 00:05 21694 /dev/nvidiactl
200800000-200a00000 rw-s 00000000 00:05 27538 /dev/nvidia0
200a00000-206200000 rw-s 00000000 00:05 21694 /dev/nvidiactl
206200000-206400000 rw-s 00000000 00:05 27538 /dev/nvidia0
206400000-207400000 —p 00000000 00:00 0
207400000-207600000 rw-s 00000000 00:05 21694 /dev/nvidiactl
207600000-207800000 rw-s 00000000 00:05 21694 /dev/nvidiactl
207800000-207a00000 rw-s 207800000 00:05 19518 /dev/nvidia-uvm
207a00000-207c00000 —p 00000000 00:00 0
207c00000-207e00000 rw-s 00000000 00:05 21694 /dev/nvidiactl
207e00000-208000000 rw-s 00000000 00:04 186466 /dev/zero (deleted)
208000000-300200000 —p 00000000 00:00 0
10000000000-10004000000 —p 00000000 00:00 0
7f390c000000-7f390c021000 rw-p 00000000 00:00 0
7f390c021000-7f3910000000 —p 00000000 00:00 0
7f3914000000-7f39f8000000 —p 00000000 00:00 0
7f39f8000000-7f39f8021000 rw-p 00000000 00:00 0
7f39f8021000-7f39fc000000 —p 00000000 00:00 0
7f39fc000000-7f39fc107000 rw-p 00000000 00:00 0
7f39fc107000-7f3a00000000 —p 00000000 00:00 0
7f3a00000000-7f3a30000000 —p 00000000 00:00 0
7f3a30000000-7f3a30021000 rw-p 00000000 00:00 0
7f3a30021000-7f3a34000000 —p 00000000 00:00 0
7f3a34000000-7f3a34021000 rw-p 00000000 00:00 0
7f3a34021000-7f3a38000000 —p 00000000 00:00 0
7f3a38000000-7f3a38021000 rw-p 00000000 00:00 0
7f3a38021000-7f3a3c000000 —p 00000000 00:00 0
7f3a3c000000-7f3a3c021000 rw-p 00000000 00:00 0
7f3a3c021000-7f3a40000000 —p 00000000 00:00 0
7f3a40000000-7f3a40021000 rw-p 00000000 00:00 0
7f3a40021000-7f3a44000000 —p 00000000 00:00 0
7f3a447f9000-7f3a447fa000 —p 00000000 00:00 0
7f3a447fa000-7f3a44ffa000 rw-p 00000000 00:00 0 [stack:17157]
7f3a44ffa000-7f3a44ffb000 —p 00000000 00:00 0
7f3a44ffb000-7f3a457fb000 rw-p 00000000 00:00 0 [stack:17156]
7f3a457fb000-7f3a457fc000 —p 00000000 00:00 0
7f3a457fc000-7f3a45ffc000 rw-p 00000000 00:00 0 [stack:17113]
7f3a45ffc000-7f3a45ffd000 —p 00000000 00:00 0
7f3a45ffd000-7f3a467fd000 rw-p 00000000 00:00 0 [stack:17112]
7f3a467fd000-7f3a467fe000 —p 00000000 00:00 0
7f3a467fe000-7f3a46ffe000 rw-p 00000000 00:00 0 [stack:17111]
7f3a46ffe000-7f3a46fff000 —p 00000000 00:00 0
7f3a46fff000-7f3a477ff000 rw-p 00000000 00:00 0 [stack:17110]
7f3a477ff000-7f3a47800000 —p 00000000 00:00 0
7f3a47800000-7f3a48000000 rw-p 00000000 00:00 0 [stack:17109]
7f3a48000000-7f3a48021000 rw-p 00000000 00:00 0
7f3a48021000-7f3a4c000000 —p 00000000 00:00 0
7f3a4c000000-7f3a4c021000 rw-p 00000000 00:00 0
7f3a4c021000-7f3a50000000 —p 00000000 00:00 0
7f3a50000000-7f3a50021000 rw-p 00000000 00:00 0
7f3a50021000-7f3a54000000 —p 00000000 00:00 0
7f3a54000000-7f3a54021000 rw-p 00000000 00:00 0
7f3a54021000-7f3a58000000 —p 00000000 00:00 0
7f3a58000000-7f3a58021000 rw-p 00000000 00:00 0
7f3a58021000-7f3a5c000000 —p 00000000 00:00 0
7f3a5c000000-7f3a5c021000 rw-p 00000000 00:00 0
7f3a5c021000-7f3a60000000 —p 00000000 00:00 0
7f3a60000000-7f3a60021000 rw-p 00000000 00:00 0
7f3a60021000-7f3a64000000 —p 00000000 00:00 0
7f3a64000000-7f3a64021000 rw-p 00000000 00:00 0
7f3a64021000-7f3a68000000 —p 00000000 00:00 0
7f3a68000000-7f3a68021000 rw-p 00000000 00:00 0
7f3a68021000-7f3a6c000000 —p 00000000 00:00 0
7f3a6c000000-7f3a6c021000 rw-p 00000000 00:00 0
7f3a6c021000-7f3a70000000 —p 00000000 00:00 0
7f3a70000000-7f3a70021000 rw-p 00000000 00:00 0
7f3a70021000-7f3a74000000 —p 00000000 00:00 0
7f3a74000000-7f3a74021000 rw-p 00000000 00:00 0
7f3a74021000-7f3a78000000 —p 00000000 00:00 0
7f3a78000000-7f3a78021000 rw-p 00000000 00:00 0
7f3a78021000-7f3a7c000000 —p 00000000 00:00 0
7f3a7c000000-7f3a7c021000 rw-p 00000000 00:00 0
7f3a7c021000-7f3a80000000 —p 00000000 00:00 0
7f3a80000000-7f3a80021000 rw-p 00000000 00:00 0
7f3a80021000-7f3a84000000 —p 00000000 00:00 0
7f3a847fd000-7f3a847fe000 —p 00000000 00:00 0
7f3a847fe000-7f3a84ffe000 rw-p 00000000 00:00 0 [stack:17108]
7f3a84ffe000-7f3a84fff000 —p 00000000 00:00 0
7f3a84fff000-7f3a857ff000 rw-p 00000000 00:00 0 [stack:17107]
7f3a857ff000-7f3a85800000 —p 00000000 00:00 0
7f3a85800000-7f3a86000000 rw-p 00000000 00:00 0 [stack:17106]
7f3a86000000-7f3aa0000000 —p 00000000 00:00 0
7f3aa03ca000-7f3aa040a000 rw-p 00000000 00:00 0
7f3aa040a000-7f3aa040b000 —p 00000000 00:00 0
7f3aa040b000-7f3aa0c0b000 rw-p 00000000 00:00 0 [stack:17105]
7f3aa0c0b000-7f3aa0c0c000 —p 00000000 00:00 0
7f3aa0c0c000-7f3aa3a06000 rw-p 00000000 00:00 0 [stack:17104]
7f3aa3ffc000-7f3aa3ffd000 —p 00000000 00:00 0
7f3aa3ffd000-7f3aa47fd000 rw-p 00000000 00:00 0 [stack:17103]
7f3aa47fd000-7f3aa47fe000 —p 00000000 00:00 0
7f3aa47fe000-7f3aa4ffe000 rw-p 00000000 00:00 0 [stack:17102]
7f3aa4ffe000-7f3aa4fff000 —p 00000000 00:00 0
7f3aa4fff000-7f3aa57ff000 rw-p 00000000 00:00 0 [stack:17101]
7f3aa57ff000-7f3aa5800000 —p 00000000 00:00 0
7f3aa5800000-7f3aa6000000 rw-p 00000000 00:00 0 [stack:17100]
7f3aa6000000-7f3ab1c00000 —p 00000000 00:00 0
7f3ab1c00000-7f3ab1e00000 rw-s 00000000 00:04 189555 /dev/zero (deleted)
7f3ab1e00000-7f3ab9000000 —p 00000000 00:00 0
7f3ab9000000-7f3ab9200000 rw-s 00000000 00:04 186469 /dev/zero (deleted)
7f3ab9200000-7f3ace400000 —p 00000000 00:00 0
7f3ace400000-7f3ace600000 rw-s 00000000 00:04 186464 /dev/zero (Aborted

1 Like

Did you get a solution?