PyTorch 1.6 - CPU Training free(): invalid next size (normal) error

Hi All,

Have a weird bug occurring. On torch 1.5.1, training my network on CPU works perfectly fine. However, when upgrading to torch 1.6.0, CPU training fails. Below is the gdb stack trace:

free(): invalid next size (normal)

Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff7805801 in __GI_abort () at abort.c:79
#2  0x00007ffff784e897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff797bb9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007ffff785590a in malloc_printerr (str=str@entry=0x7ffff797d8b8 "free(): invalid next size (normal)") at malloc.c:5350
#4  0x00007ffff785d0ad in _int_free (have_lock=0, p=0x555586de8db0, av=0x7ffff7bb0c40 <main_arena>) at malloc.c:4286
#5  __GI___libc_free (mem=0x555586de8dc0) at malloc.c:3124
#6  0x00007fff33ac9203 in _ZNSt17_Function_handlerIFvPvEUlS0_E0_E9_M_invokeERKSt9_Any_dataOS0_ () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#7  0x00007fff33ab7333 in std::_Sp_counted_deleter<void*, std::function<void (void*)>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#8  0x00007fff33ad18da in at::native::IntrusivePtrTargetWrapper<ideep::tensor>::~IntrusivePtrTargetWrapper() () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#9  0x00007fff3347a1b9 in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_() () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#10 0x00007fff33abfd6b in at::native::mkldnn_convolution_backward_weights(c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#11 0x00007fff33cc263e in at::TypeDefault::mkldnn_convolution_backward_weights(c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#12 0x00007fff33cf0266 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, at::Tensor> (*)(c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool), std::tuple<at::Tensor, at::Tensor>, c10::guts::typelist::typelist<c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool> >, std::tuple<at::Tensor, at::Tensor> (c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool)>::call(c10::OperatorKernel*, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#13 0x00007fff33c2aeaf in at::mkldnn_convolution_backward_weights(c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool)
    () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#14 0x00007fff33abf8fb in at::native::mkldnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#15 0x00007fff33cc271a in at::TypeDefault::mkldnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#16 0x00007fff33cf02bc in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, at::Tensor, at::Tensor> (*)(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>), std::tuple<at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul> > >, std::tuple<at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>)>::call(c10::OperatorKernel*, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#17 0x00007fff33c2e4bb in at::mkldnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#18 0x00007fff3587c632 in torch::autograd::VariableType::mkldnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#19 0x00007fff33cf02bc in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, at::Tensor, at::Tensor> (*)(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>), std::tuple<at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul> > >, std::tuple<at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>)>::call(c10::OperatorKernel*, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#20 0x00007fff33c2e4bb in at::mkldnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
---Type <return> to continue, or q <return> to quit---
#21 0x00007fff35782f93 in torch::autograd::generated::MkldnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) ()
   from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#22 0x00007fff35da1017 in torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#23 0x00007fff35d9c860 in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#24 0x00007fff35d9d401 in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#25 0x00007fff35d9ab1c in torch::autograd::Engine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>) ()
   from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#26 0x00007fff39cbadcc in torch::autograd::python::PythonEngine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>) ()
   from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#27 0x00007fff35d99e53 in torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#28 0x00007fff39cbabbe in torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#29 0x00007fff39cbb889 in THPEngine_run_backward(THPEngine*, _object*, _object*) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#30 0x00005555556b87e6 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:694
#31 0x00005555556b8861 in _PyCFunction_FastCallKeywords (func=0x7ffe2c19b2d0, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>)
    at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:734
#32 0x00005555557247cc in call_function (kwnames=0x7fffedb0ead0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:4568
#33 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3139
#34 0x0000555555668539 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3930
#35 0x00005555556b7ef5 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:433
#36 0x0000555555723f29 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:4616
#37 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3093
#38 0x0000555555668539 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3930
#39 0x00005555556b7ef5 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:433
#40 0x000055555571fa93 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:4616
#41 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3110
#42 0x00005555556b7ccb in function_code_fastcall (globals=<optimized out>, nargs=1, args=<optimized out>, co=<optimized out>) at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:283
#43 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:408
#44 0x000055555571fa93 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:4616
#45 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3110
#46 0x00005555556b7ccb in function_code_fastcall (globals=<optimized out>, nargs=3, args=<optimized out>, co=<optimized out>) at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:283
#47 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:408
#48 0x000055555571f806 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:4616
#49 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3124
#50 0x0000555555668539 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3930
#51 0x00005555556b7ef5 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:433
#52 0x000055555571f806 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:4616
#53 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3124
#54 0x0000555555668539 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3930
#55 0x0000555555669424 in PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3959
---Type <return> to continue, or q <return> to quit---
#56 0x000055555566944c in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:524
#57 0x000055555577eb74 in run_mod () at /tmp/build/80754af9/python_1565725737370/work/Python/pythonrun.c:1035
#58 0x0000555555788eb1 in PyRun_FileExFlags () at /tmp/build/80754af9/python_1565725737370/work/Python/pythonrun.c:988
#59 0x00005555557890a3 in PyRun_SimpleFileExFlags () at /tmp/build/80754af9/python_1565725737370/work/Python/pythonrun.c:429
#60 0x000055555578a195 in pymain_run_file (p_cf=0x7fffffffe110, filename=0x5555558c0900 L"train.py", fp=0x5555559090e0) at /tmp/build/80754af9/python_1565725737370/work/Modules/main.c:433
#61 pymain_run_filename (cf=0x7fffffffe110, pymain=0x7fffffffe220) at /tmp/build/80754af9/python_1565725737370/work/Modules/main.c:1612
#62 pymain_run_python (pymain=0x7fffffffe220) at /tmp/build/80754af9/python_1565725737370/work/Modules/main.c:2873
#63 pymain_main () at /tmp/build/80754af9/python_1565725737370/work/Modules/main.c:3413
#64 0x000055555578a2bc in _Py_UnixMain () at /tmp/build/80754af9/python_1565725737370/work/Modules/main.c:3448
#65 0x00007ffff77e6b97 in __libc_start_main (main=0x5555556496c0 <main>, argc=6, argv=0x7fffffffe378, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe368)
    at ../csu/libc-start.c:310
#66 0x000055555572f062 in _start () at ../sysdeps/x86_64/elf/start.S:103

Running my training script in another gdb session yields a corrupted double-linked list error, though the stack trace points to the same spot:

corrupted double-linked list

Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff7805801 in __GI_abort () at abort.c:79
#2  0x00007ffff784e897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff797bb9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007ffff785590a in malloc_printerr (str=str@entry=0x7ffff7979cba "corrupted double-linked list") at malloc.c:5350
#4  0x00007ffff7855ac4 in malloc_consolidate (av=av@entry=0x7ffff7bb0c40 <main_arena>) at malloc.c:4456
#5  0x00007ffff785d03b in _int_free (have_lock=0, p=<optimized out>, av=0x7ffff7bb0c40 <main_arena>) at malloc.c:4362
#6  __GI___libc_free (mem=0x5555c0ecf480) at malloc.c:3124
#7  0x00007fff33ac9203 in _ZNSt17_Function_handlerIFvPvEUlS0_E0_E9_M_invokeERKSt9_Any_dataOS0_ () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#8  0x00007fff33ab7333 in std::_Sp_counted_deleter<void*, std::function<void (void*)>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#9  0x00007fff33ab81fa in ideep::tensor::~tensor() () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#10 0x00007fff33ac4045 in void ideep::convolution_backward_weights::compute_impl<true>(ideep::tensor const&, ideep::tensor const&, std::vector<long, std::allocator<long> > const&, ideep::tensor&, ideep::tensor&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, int, dnnl::memory::data_type, dnnl::algorithm, ideep::engine const&) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#11 0x00007fff33abfc6f in at::native::mkldnn_convolution_backward_weights(c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#12 0x00007fff33cc263e in at::TypeDefault::mkldnn_convolution_backward_weights(c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#13 0x00007fff33cf0266 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, at::Tensor> (*)(c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool), std::tuple<at::Tensor, at::Tensor>, c10::guts::typelist::typelist<c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool> >, std::tuple<at::Tensor, at::Tensor> (c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool)>::call(c10::OperatorKernel*, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#14 0x00007fff33c2aeaf in at::mkldnn_convolution_backward_weights(c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool)
    () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#15 0x00007fff33abf8fb in at::native::mkldnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#16 0x00007fff33cc271a in at::TypeDefault::mkldnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#17 0x00007fff33cf02bc in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, at::Tensor, at::Tensor> (*)(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>), std::tuple<at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul> > >, std::tuple<at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>)>::call(c10::OperatorKernel*, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#18 0x00007fff33c2e4bb in at::mkldnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#19 0x00007fff3587c632 in torch::autograd::VariableType::mkldnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#20 0x00007fff33cf02bc in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, at::Tensor, at::Tensor> (*)(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>), std::tuple<at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul> > >, std::tuple<at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>)>::call(c10::OperatorKernel*, at::Tensor const&, at::Tensor con---Type <return> to continue, or q <return> to quit---
st&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#21 0x00007fff33c2e4bb in at::mkldnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, std::array<bool, 3ul>) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#22 0x00007fff35782f93 in torch::autograd::generated::MkldnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) ()
   from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#23 0x00007fff35da1017 in torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#24 0x00007fff35d9c860 in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#25 0x00007fff35d9d401 in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#26 0x00007fff35d9ab1c in torch::autograd::Engine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>) ()
   from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#27 0x00007fff39cbadcc in torch::autograd::python::PythonEngine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>) ()
   from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#28 0x00007fff35d99e53 in torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#29 0x00007fff39cbabbe in torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#30 0x00007fff39cbb889 in THPEngine_run_backward(THPEngine*, _object*, _object*) () from /home/kevin/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#31 0x00005555556b87e6 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:694
#32 0x00005555556b8861 in _PyCFunction_FastCallKeywords (func=0x7ffe2c1827d0, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>)
    at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:734
#33 0x00005555557247cc in call_function (kwnames=0x7fffeb61cbd0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:4568
#34 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3139
#35 0x0000555555668539 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3930
#36 0x00005555556b7ef5 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:433
#37 0x0000555555723f29 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:4616
#38 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3093
#39 0x0000555555668539 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3930
#40 0x00005555556b7ef5 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:433
#41 0x000055555571fa93 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:4616
#42 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3110
#43 0x00005555556b7ccb in function_code_fastcall (globals=<optimized out>, nargs=1, args=<optimized out>, co=<optimized out>) at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:283
#44 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:408
#45 0x000055555571fa93 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:4616
#46 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3110
#47 0x00005555556b7ccb in function_code_fastcall (globals=<optimized out>, nargs=3, args=<optimized out>, co=<optimized out>) at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:283
#48 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:408
#49 0x000055555571f806 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:4616
#50 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3124
#51 0x0000555555668539 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3930
#52 0x00005555556b7ef5 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:433
#53 0x000055555571f806 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:4616
---Type <return> to continue, or q <return> to quit---
#54 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3124
#55 0x0000555555668539 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3930
#56 0x0000555555669424 in PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3959
#57 0x000055555566944c in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:524
#58 0x000055555577eb74 in run_mod () at /tmp/build/80754af9/python_1565725737370/work/Python/pythonrun.c:1035
#59 0x0000555555788eb1 in PyRun_FileExFlags () at /tmp/build/80754af9/python_1565725737370/work/Python/pythonrun.c:988
#60 0x00005555557890a3 in PyRun_SimpleFileExFlags () at /tmp/build/80754af9/python_1565725737370/work/Python/pythonrun.c:429
#61 0x000055555578a195 in pymain_run_file (p_cf=0x7fffffffe110, filename=0x5555558c0900 L"train.py", fp=0x5555559090e0) at /tmp/build/80754af9/python_1565725737370/work/Modules/main.c:433
#62 pymain_run_filename (cf=0x7fffffffe110, pymain=0x7fffffffe220) at /tmp/build/80754af9/python_1565725737370/work/Modules/main.c:1612
#63 pymain_run_python (pymain=0x7fffffffe220) at /tmp/build/80754af9/python_1565725737370/work/Modules/main.c:2873
#64 pymain_main () at /tmp/build/80754af9/python_1565725737370/work/Modules/main.c:3413
#65 0x000055555578a2bc in _Py_UnixMain () at /tmp/build/80754af9/python_1565725737370/work/Modules/main.c:3448
#66 0x00007ffff77e6b97 in __libc_start_main (main=0x5555556496c0 <main>, argc=6, argv=0x7fffffffe378, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe368)
    at ../csu/libc-start.c:310
#67 0x000055555572f062 in _start () at ../sysdeps/x86_64/elf/start.S:103

Let me know if there is any other information I can provide that would be helpful, thanks!

Which CPU are you using and how did you install PyTorch?
Did you use the binaries/wheels or did you build from source?

Also, could you post a minimal code snippet to reproduce this issue, please?

Hi Peter,

I installed PyTorch via pip, here is the output of collect_envs:

PyTorch version: 1.6.0
Is debug build: No
CUDA used to build PyTorch: 10.2

OS: Ubuntu 18.04.4 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce RTX 2080 Ti
Nvidia driver version: 440.100
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] msgpack-numpy==0.4.4.3
[pip] numpy==1.18.4
[pip] numpydoc==0.9.1
[pip] pytorch-memlab==0.1.0
[pip] torch==1.6.0
[pip] torchvision==0.7.0
[conda] _pytorch_select           0.1                       cpu_0  
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.2.89              hfd86e86_1  
[conda] mkl                       2019.4                      243  
[conda] mkl-service               2.3.0            py37he904b0f_0  
[conda] mkl_fft                   1.0.15           py37ha843d7b_0  
[conda] mkl_random                1.1.0            py37hd6b4f25_0  
[conda] msgpack-numpy             0.4.4.3                    py_0  
[conda] numpy                     1.18.4                   pypi_0    pypi
[conda] numpy-base                1.18.1           py37hde5b4d6_1  
[conda] numpydoc                  0.9.1                      py_0  
[conda] pytorch-memlab            0.1.0                    pypi_0    pypi
[conda] torch                     1.6.0                    pypi_0    pypi
[conda] torchvision               0.7.0                    pypi_0    pypi

The results of torch.config are below, I am using a Intel® Core™ i7-9800X CPU @ 3.80GHz:

PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

Here is the code snippet you can use to re-produce the error

import torch
import torch.nn as nn
import torch.nn.functional as F


class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.embedding_size = 16
        self.filter_num = 512
        self.padding_length = 25
        self.convolutions = nn.ModuleList([nn.Conv1d(1, self.filter_num // 8, kernel_size=(K, self.embedding_size), stride=1) for K in range(1, 9)])

    def forward(self):
        X = torch.randn([300, 1, self.padding_length, self.embedding_size])
        X = [torch.tanh(convolution(X).squeeze(3)) for convolution in self.convolutions]
        X = [F.max_pool1d(x, x.size(2)).squeeze(2) for x in X]
        X = torch.cat(X, dim=1)
        return X

if __name__ == "__main__":
    model = Model()
    output = model()
    output.mean().backward()

In the above code snippet, where the arrow is, I have figured out that if I swap out the two lines (i.e. replace the 1D convolution with just a random tensor) the example runs without any segmentation fault. This also is further supported by the previous stack trace I showed from gdb where it errors out during a convolution. Additionally, if I turn off the MKLDNN backend via torch.backends.mkldnn.enabled = False, example executes without error.

EDIT: When self.embedding_size is equal to 14 or higher, the example produces a seg-fault. Any lower integer value executes successfully.

When running the sample with MKLDNN_VERBOSE=1, this is the following output:

dnnl_verbose,info,oneDNN v1.5.0 (commit e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
dnnl_verbose,info,cpu,runtime:OpenMP
dnnl_verbose,info,cpu,isa:Intel AVX-512 with AVX512BW, AVX512VL, and AVX512DQ extensions
dnnl_verbose,info,gpu,runtime:none
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb16a:f0,,,64x1x1x16,0.000976562
dnnl_verbose,exec,cpu,convolution,jit:avx512_common,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:Acdb16a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:aBcd16b:f0,scratchpad_mode:user;,alg:convolution_direct,mb6400_ic1oc64_ih25oh25kh1sh1dh0ph0_iw16ow1kw16sw1dw0pw0,13.3792
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:aBcd16b:f0 dst_f32::blocked:abcd:f0,,,6400x64x25x1,6.97119
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb16a:f0,,,64x1x2x16,0.0100098
dnnl_verbose,exec,cpu,convolution,jit:avx512_common,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:Acdb16a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:aBcd16b:f0,scratchpad_mode:user;,alg:convolution_direct,mb6400_ic1oc64_ih25oh24kh2sh1dh0ph0_iw16ow1kw16sw1dw0pw0,7.95996
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:aBcd16b:f0 dst_f32::blocked:abcd:f0,,,6400x64x24x1,3.48096
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb16a:f0,,,64x1x3x16,0.0090332
dnnl_verbose,exec,cpu,convolution,jit:avx512_common,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:Acdb16a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:aBcd16b:f0,scratchpad_mode:user;,alg:convolution_direct,mb6400_ic1oc64_ih25oh23kh3sh1dh0ph0_iw16ow1kw16sw1dw0pw0,6.08813
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:aBcd16b:f0 dst_f32::blocked:abcd:f0,,,6400x64x23x1,2.54492
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb16a:f0,,,64x1x4x16,0.0090332
dnnl_verbose,exec,cpu,convolution,jit:avx512_common,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:Acdb16a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:aBcd16b:f0,scratchpad_mode:user;,alg:convolution_direct,mb6400_ic1oc64_ih25oh22kh4sh1dh0ph0_iw16ow1kw16sw1dw0pw0,5.96704
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:aBcd16b:f0 dst_f32::blocked:abcd:f0,,,6400x64x22x1,2.3728
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb16a:f0,,,64x1x5x16,0.0100098
dnnl_verbose,exec,cpu,convolution,jit:avx512_common,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:Acdb16a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:aBcd16b:f0,scratchpad_mode:user;,alg:convolution_direct,mb6400_ic1oc64_ih25oh21kh5sh1dh0ph0_iw16ow1kw16sw1dw0pw0,6.96484
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:aBcd16b:f0 dst_f32::blocked:abcd:f0,,,6400x64x21x1,2.271
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb16a:f0,,,64x1x6x16,0.0100098
dnnl_verbose,exec,cpu,convolution,jit:avx512_common,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:Acdb16a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:aBcd16b:f0,scratchpad_mode:user;,alg:convolution_direct,mb6400_ic1oc64_ih25oh20kh6sh1dh0ph0_iw16ow1kw16sw1dw0pw0,7.69409
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:aBcd16b:f0 dst_f32::blocked:abcd:f0,,,6400x64x20x1,2.16504
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb16a:f0,,,64x1x7x16,0.0100098
dnnl_verbose,exec,cpu,convolution,jit:avx512_common,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:Acdb16a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:aBcd16b:f0,scratchpad_mode:user;,alg:convolution_direct,mb6400_ic1oc64_ih25oh19kh7sh1dh0ph0_iw16ow1kw16sw1dw0pw0,8.2019
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:aBcd16b:f0 dst_f32::blocked:abcd:f0,,,6400x64x19x1,2.01904
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb16a:f0,,,64x1x8x16,0.00585938
dnnl_verbose,exec,cpu,convolution,jit:avx512_common,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:Acdb16a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:aBcd16b:f0,scratchpad_mode:user;,alg:convolution_direct,mb6400_ic1oc64_ih25oh18kh8sh1dh0ph0_iw16ow1kw16sw1dw0pw0,7.66797
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:aBcd16b:f0 dst_f32::blocked:abcd:f0,,,6400x64x18x1,1.54517
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:aBcd16b:f0,,,6400x64x18x1,1.90308
dnnl_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd:f0 dst_f32:p:blocked:ABcd16a16b:f0,,,64x1x8x16,0.00805664
dnnl_verbose,exec,cpu,convolution,jit:avx512_common,backward_data,src_f32:p:blocked:aBcd16b:f0 wei_f32:p:blocked:ABcd16a16b:f0 bia_undef::undef::f0 dst_f32::blocked:aBcd16b:f0,,alg:convolution_direct,mb6400_ic1oc64_ih25oh18kh8sh1dh0ph0_iw16ow1kw16sw1dw0pw0,64.8398
dnnl_verbose,exec,cpu,reorder,simple:any,undef,src_f32:p:blocked:aBcd16b:f0 dst_f32::blocked:abcd:f0,,,6400x1x25x16,3.02197
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:aBcd16b:f0,,,6400x64x18x1,1.54883
dnnl_verbose,exec,cpu,convolution,jit:avx512_common,backward_weights,src_f32::blocked:abcd:f0 wei_f32::blocked:Acdb16a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:aBcd16b:f0,,alg:convolution_direct,mb6400_ic1oc64_ih25oh18kh8sh1dh0ph0_iw16ow1kw16sw1dw0pw0,2.87183
Segmentation fault (core dumped)

Thanks for the code snippet.
I cannot reproduce the issue using the 1.6.0 conda binaries:

Collecting environment information...
PyTorch version: 1.6.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.3 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2
Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.2.86
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: TITAN V

Nvidia driver version: 450.51.06
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] numpydoc==1.1.0
[pip3] torch==1.6.0
[pip3] torchvision==0.7.0
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               10.2.89              hfd86e86_1
[conda] mkl                       2020.2                      256
[conda] mkl-service               2.3.0            py37he904b0f_0
[conda] mkl_fft                   1.1.0            py37h23d657b_0
[conda] mkl_random                1.1.1            py37h0573a6f_0
[conda] numpy                     1.19.1           py37hbc911f0_0
[conda] numpy-base                1.19.1           py37hfa32c7d_0
[conda] numpydoc                  1.1.0                      py_0
[conda] pytorch                   1.6.0           py3.7_cuda10.2.89_cudnn7.6.5_0    pytorch
[conda] torchvision               0.7.0                py37_cu102    pytorch

and using an Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz.

However, based on the env outputs, it seems I’m using a newer mkl than you are.
Could you try to install the conda binaries (I’m unsure if the pip wheels ship with an older mkl version) and rerun the code?
If you are still facing this error, could you please create an issue on GitHub so that we could track and fix it?

I tried using the conda binaries, still able to re-produce the issue.

Have opened a GitHub issue here: https://github.com/pytorch/pytorch/issues/45746

Thanks for all the help!