Get segmentation fault when using deepspeech_pytorch model
Hi! I’m using deepspeech_pytorch model (GitHub - SeanNaren/deepspeech.pytorch: Speech Recognition using DeepSpeech2.) to do some speech recognition tasks, and the segmentation fault occurs during the training phase.
The code block of training phase:
adversary_model = AdversaryModel( ... )
adv_trainset = WavDataset(adv_trainset_wav_path_list, model.labels, adv_train_spect_parser)
adv_trainset_dataloader = AudioAdvTrainDataLoader(dataset=adv_trainset, batch_size=FLAGS.batch_size, pin_memory=True, shuffle=True)
model.train()
# Because my task is something like adversary example,
# all the params I need to train is only the "delta",
# which is declared in addition as nn.Parameter(),
# but not the params inner the neural network.
opt = torch.optim.Adam([adversary_model.delta], lr=FLAGS.adv_lr)
adversary_model.to(FLAGS.device)
adversary_model.model.to(FLAGS.device)
for ep in range(FLAGS.adv_train_epochs):
print(f'\nTrain epoch = { ep }\n', flush=True)
adversary_model.train()
for (batch_idx, batch) in enumerate(adv_trainset_dataloader):
opt.zero_grad()
# A lot of codes to input data batch into the model and compute loss ...
loss *= adversary_model.factor_loss
loss.backward()
opt.step()
And segmentation fault occurs like this:
(venv) [hxt@cosec-workstation:~/ASR_adversarial_examples_new]$ python main.py
Seed set to 42
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/hxt/ASR_adversarial_examples_new/deepspeech.pytorch/deepspeech_pytorch/loader/data_loader.py:92: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
y = torch.cuda.FloatTensor(y)
/home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/functional.py:650: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:863.)
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
Fatal Python error: Segmentation fault
Current thread 0x00007fd5af1c73c0 (most recent call first):
File "/home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 456 in _conv_forward
File "/home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 460 in forward
File "/home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
File "/home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
File "/home/hxt/ASR_adversarial_examples_new/deepspeech.pytorch/deepspeech_pytorch/model.py", line 60 in forward
File "/home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
File "/home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
File "/home/hxt/ASR_adversarial_examples_new/deepspeech.pytorch/deepspeech_pytorch/model.py", line 217 in forward
File "/home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
File "/home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
File "/home/hxt/ASR_adversarial_examples_new/pl_models.py", line 71 in forward
File "/home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
File "/home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
File "main.py", line 559 in main
File "/home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/absl/app.py", line 254 in _run_main
File "/home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/absl/app.py", line 308 in run
File "main.py", line 652 in <module>
In codes of deepspeech, the initialization of the convolution layer is something like this:
self.conv = MaskConv(nn.Sequential(
nn.Conv2d(1, 32, kernel_size=(41, 11), stride=(2, 2), padding=(20, 5)),
nn.BatchNorm2d(32),
nn.Hardtanh(0, 20, inplace=True),
nn.Conv2d(32, 32, kernel_size=(21, 11), stride=(2, 1), padding=(10, 5)),
nn.BatchNorm2d(32),
nn.Hardtanh(0, 20, inplace=True)
))
And the error seems occur during the forward phase (line 53 ~ 69 of deepspeech_pytorch/model.py):
def forward(self, x, lengths): # forward function of MaskConv(nn.Module)
"""
:param x: The input of size BxCxDxT
:param lengths: The actual length of each sequence in the batch
:return: Masked output from the module
"""
for module in self.seq_module:
x = module(x) # Line 60, Segmentation error occurs
mask = torch.BoolTensor(x.size()).fill_(0)
if x.is_cuda:
mask = mask.cuda()
for i, length in enumerate(lengths):
length = length.item()
if (mask[i].size(2) - length) > 0:
mask[i].narrow(2, length, mask[i].size(2) - length).fill_(1)
x = x.masked_fill(mask, 0)
return x, lengths
I’ve tried using gdb to debug it, and get the stacktrace:
(gdb) backtrace
#0 0x00007fffb7ac76d0 in () at /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1 0x00007fffb773b6dc in () at /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007fffb791e3be in () at /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007ffd513de57b in () at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#4 0x00007ffd51439d96 in () at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#5 0x00007ffd4fcdbd40 in cudnn::cnn::precompute_indices(cudnnContext*, cudnnTensorStruct const*, cudnnFilterStruct const*, cudnnConvolutionStruct const*, void*, unsigned long) ()
at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#6 0x00007ffd4f98232e in cudnn::cnn::PrecomputedGemmEngine<cudnn::cnn::convolve_launch_pg_pf<float, float, float, float>, 5, 2, 2>::execute_intermediate(cudnnContext*, cudnn::cnn::ExecutionIntermediate&) const () at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#7 0x00007ffd4f98501f in cudnn::cnn::PrecomputedGemmEngine<cudnn::cnn::convolve_launch_pg_pf<float, float, float, float>, 5, 2, 2>::execute_internal_impl(cudnn::backend::VariantPack const&, CUstream_st*) () at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#8 0x00007ffd4f550f25 in cudnn::cnn::EngineInterface::execute(cudnn::backend::VariantPack const&, CUstream_st*) ()
at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#9 0x00007ffd4f5833b0 in cudnn::cnn::EngineContainer<(cudnnBackendEngineName_t)1>::execute_internal_impl(cudnn::backend::VariantPack const&, CUstream_st*) ()
at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#10 0x00007ffd4f550f25 in cudnn::cnn::EngineInterface::execute(cudnn::backend::VariantPack const&, CUstream_st*) ()
at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#11 0x00007ffd4f6b806e in cudnn::cnn::AutoTransformationExecutor::execute_pipeline(cudnn::cnn::EngineInterface&, cudnn::backend::VariantPack const&, CUstream_st*) const ()
at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#12 0x00007ffd4f6b81b7 in cudnn::cnn::BatchPartitionExecutor::operator()(cudnn::cnn::EngineInterface&, cudnn::cnn::EngineInterface*, cudnn::backend::VariantPack const&, CUstream_st*) const ()
at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#13 0x00007ffd4f6d5c9b in cudnn::cnn::GeneralizedConvolutionEngine<cudnn::cnn::EngineContainer<(cudnnBackendEngineName_t)1> >::execute_internal_impl(cudnn::backend::VariantPack const&, CUstream_st*) () at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#14 0x00007ffd4f550f25 in cudnn::cnn::EngineInterface::execute(cudnn::backend::VariantPack const&, CUstream_st*) ()
at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#15 0x00007ffd4f563b3c in cudnn::backend::execute(cudnnContext*, cudnn::backend::ExecutionPlan&, cudnn::backend::VariantPack&) ()
at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
#16 0x00007ffd4f563f3d in cudnnBackendExecute () at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8
--Type <RET> for more, q to quit, c to continue without paging--
#17 0x00007fff74ad65dd in at::native::run_conv_plan(cudnnContext*, at::Tensor const&, at::Tensor const&, at::Tensor const&, cudnn_frontend::ExecutionPlan_v8 const&) ()
at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#18 0x00007fff74adef20 in at::native::try_configs(std::vector<std::shared_ptr<cudnn_frontend::OpaqueBackendPointer>, std::allocator<std::shared_ptr<cudnn_frontend::OpaqueBackendPointer> > >&, std::string const&, at::native::(anonymous namespace)::CacheKeyWrapper const&, cudnnContext*, at::Tensor const&, at::Tensor const&, at::Tensor const&) ()
at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#19 0x00007fff74adf5e7 in at::native::run_single_conv(cudnnBackendDescriptorType_t, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) () at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#20 0x00007fff74ae03ab in at::native::raw_cudnn_convolution_forward_out(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) () at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#21 0x00007fff74ac4bbb in at::native::cudnn_convolution_forward(char const*, at::TensorArg const&, at::TensorArg const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) () at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#22 0x00007fff74ac5166 in at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) ()
at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#23 0x00007fff769cd823 in at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) () at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#24 0x00007fff769cd8e0 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__cudnn_convolution>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool> >, at::Tensor (at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) () at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#25 0x00007fffa1cc7190 in at::_ops::cudnn_convolution::call(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) ()
at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#26 0x00007fffa0fb90d5 in at::native::_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool) () at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#27 0x00007fffa2071876 in at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd___convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long, bool, bool, bool, bool) ()
at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#28 0x00007fffa20718f7 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long, bool, bool, bool, bool), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd___convolution>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long, bool, bool, bool, bool> >, at::Tensor (at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long, bool, bool, bool, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long, bool, bool, bool, bool) () at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#29 0x00007fffa18168db in at::_ops::_convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long, bool, bool, bool, bool) () at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#30 0x00007fffa0fad5ad in at::native::convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long) () at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#31 0x00007fffa2071455 in at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long) ()
at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#32 0x00007fffa20714bf in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long), &at::(anonymous namespace)::(anonymous namespace)
--Type <RET> for more, q to quit, c to continue without paging--
::wrapper_CompositeExplicitAutograd__convolution>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long> >, at::Tensor (at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long) ()
at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#33 0x00007fffa17dd0cf in at::_ops::convolution::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long) () at /home/hxt/ASR_adversarial_examples_new/venv/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#34 0x00007fffa35642c3 in torch::autograd::VariableType::(anonymous namespace)::convolution(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long) ()
...
Though I got the stacktrace above, I can’t figure out why this error occurs and how to solve it.
My environmental information are here:
- OS: Ubuntu 18.04
- Python: 3.8.13 (venv)
- torch: 2.1.2
- torchaudio: 2.1.2
- pytorch-lightning: 2.1.3
- CUDA: 12.1
- cuDNN: 8.9.7
Could anyone help me to solve this error? Thank you so much!!