Segmentation fault (core dumped) during Torch finetuning on new setup

RubenFricke · March 11, 2023, 12:43pm

Hello,

I’m encountering a problem while running a finetune script for my FlanT5(-large) model. The script crashes with the message Segmentation fault (core dumped) after a few (thousand) steps. The error seems to occur randomly, sometimes within a few minutes and sometimes it takes around 40 minutes.

Previously, I had run the same script on my laptop with a smaller model and dataset. However, I needed to increase the size of both, so I used a more powerful computer that was available to me. Unfortunately, I’m encountering this segmentation error on this new setup.

Here are the details of my new setup:

Ubuntu 22.04
CUDA 11.7
PyTorch build 1.13.1 (installed via pip3 install torch torchvision torchaudio)
I followed the instructions in this post (20.04 - Is there a way to know what really caused a specific segfault? - Ask Ubuntu) and enabled XMP, but it didn’t solve the problem.
2x RTX 3090 (I tried using both a single GPU and both GPUs with data parallelization)
Around 200GB of RAM

I monitored the usage of VRAM, RAM, and CPU, but none of them exceeded their memory limits.

Here are the imports used in my finetune script:

import torch
import torch.nn as nn
from tqdm import tqdm
from pathlib import Path
from collections import OrderedDict
from torch.utils.data import DataLoader, Dataset
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from tensorboardX import SummaryWriter

from transformers.optimization import Adafactor
from rouge import Rouge

import re
import os
import glob
import torch
import random
import logging
import numpy as np
import json

I also tried to get more information using GDB, as suggested in this post (How to debug a Python segmentation fault? - Stack Overflow), but was unable to extract any useful debug information.

I’m currently at a loss for what to try next. Please let me know if you need any more information. Any help would be greatly appreciated. Thank you in advance.

ptrblck · March 11, 2023, 7:58pm

It’s strange to hear that gdb wasn’t able to give you the stackrace after running into a segfault.
Could you post the output of the segfault and what the backtrace returns (even if it fails to return the full backtrace)?

RubenFricke · March 13, 2023, 9:57am

Thank you for your message @ptrblck . The following backtrace is returned:

loss=3.63][New Thread 0x7fff7a1ff640 (LWP 21242)]
 14%|██████████████▋                                                                                            | 1766/12886 [07:08<45:33,  4.07it/s, epoch=1, loss=3.17]
Thread 79 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff0e1d3640 (LWP 21240)]
0x00007fffab4f98f0 in c10::intrusive_ptr<c10::detail::ListImpl, c10::detail::intrusive_target_default_null_type<c10::detail::ListImpl> >::reset_() () from /home/ruben/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so

ptrblck · March 13, 2023, 5:22pm

Thanks for the output!
I see already the failing thread pointing to reset_() so I’m unsure why you cannot get any useful information out of gdb. Calling bt will show the backtrace showing the calls to this function causing the segfault.

RubenFricke · March 13, 2023, 6:15pm

Calling bt results in the following output:

#0  0x00007fffab4f98f0 in c10::intrusive_ptr<c10::detail::ListImpl, c10::detail::intrusive_target_default_null_type<c10::detail::ListImpl> >::reset_() ()
   from /home/ruben/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#1  0x0000002000004002 in ?? ()
#2  0x00007fff0e1d1e10 in ?? ()
#3  0x0000000000000000 in ?? ()

Looks like something is going wrong…? Thanks again!

RubenFricke · March 14, 2023, 8:06am

If you require any further information regarding the training loop, please do not hesitate to let me know, @ptrblck.

ptrblck · March 14, 2023, 8:09am

No, I don’t think more information from the stacktrace is needed as it doesn’t seem to show any more details. In case you could update to the latest nightly release, you could check if this error is still visible there as it could be related to this one, which doesn’t fail anymore.
If it’s still failing a minimal and executable code snippet would be needed to further debug the issue.

RubenFricke · March 14, 2023, 10:52am

Thanks @ptrblck ,

After upgrading Pytorch to the latest nightly release with Cuda 11.7, re-running my training loop continues to fail, but now returns a more extensive stack trace.

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
__GI__dl_find_object (pc1=0x7ffff6706058 <_Unwind_RaiseException+72>, result=0x7fffffffb8c8) at ./elf/dl-find_object.c:442
442     ./elf/dl-find_object.c: No such file or directory.
(gdb) bt
#0  __GI__dl_find_object (pc1=0x7ffff6706058 <_Unwind_RaiseException+72>, result=0x7fffffffb8c8) at ./elf/dl-find_object.c:442
#1  0x00007ffff67080f6 in _Unwind_Find_FDE () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#2  0x00007ffff6704833 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#3  0x00007ffff6705ad0 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#4  0x00007ffff6706059 in _Unwind_RaiseException () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#5  0x00007fffd88ae50b in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007fffd88a538a in std::__throw_logic_error(char const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007fffd88f66b1 in char* std::string::_S_construct<char const*>(char const*, char const*, std::allocator<char> const&, std::forward_iterator_tag) ()
   from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007fffd88f6b14 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) ()
   from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007fffbde17b03 in torch::Library::_parseNameForLib(char const*) const () from /home/ruben/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#10 0x00007fffbde1df85 in torch::Library::_impl(char const*, torch::CppFunction&&, torch::_RegisterOrVerify) & ()
   from /home/ruben/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
--Type <RET> for more, q to quit, c to continue without paging--
#11 0x00007fffbdb6c3e1 in at::native::TORCH_LIBRARY_IMPL_init_aten_Conjugate_3(torch::Library&) ()
   from /home/ruben/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#12 0x0000555614ecd120 in ?? ()
#13 0x0000000000000000 in ?? ()

Some snippets from my training loop:

        epoch += 1

        ### Training
        model.train()

        with torch.enable_grad(), tqdm(total=num_train) as progress_bar:
            for batch_num, batch in enumerate(train_loader):
                real_batch_size = len(batch["source_ids"])

                loss, logits = forward(model, device, batch)
        
                loss = loss / k_gradient_accumulation_steps
                loss = loss.mean()
                
                loss.backward()
                loss_val = loss.item() * k_gradient_accumulation_steps  # get the item since loss is a tensor
                
                if ((batch_num + 1) % k_gradient_accumulation_steps == 0) or (batch_num + 1 == len(train_loader)):

                    # Backward
                    nn.utils.clip_grad_norm_(model.parameters(), k_max_grad_norm)
                    optimizer.step()
                    scheduler.step()
                    
                    optimizer.zero_grad()
                    (...)

With the following forward function:

def forward(model, device, batch):
    src_ids = batch["source_ids"].to(device, dtype=torch.long)
    src_mask = batch["source_mask"].to(device, dtype=torch.long)
    tgt_ids = batch["target_ids"].to(device, dtype=torch.long)

    tgt_ids[tgt_ids[:, :] == 0] = -100
    label_ids = tgt_ids.to(device)

    out_dict = model(src_ids, attention_mask=src_mask, labels=label_ids, return_dict=True)
    loss, logits = out_dict['loss'], out_dict['logits']
    return loss, logits

Is the information provided sufficient or should I create a basic application? Thank you in advance.

ptrblck · March 14, 2023, 4:14pm

I cannot explain why the code is failing and at which point as it seems a std::string construction fails in:

#8  0x00007fffd88f6b14 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) ()
   from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007fffbde17b03 in torch::Library::_parseNameForLib(char const*) const () from /home/ruben/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so

Since you can reproduce the issue in the latest nightly, I would recommend creating a new issue on GitHub so that we could track and and try to reproduce it.

RubenFricke · March 20, 2023, 10:40am

After rechecking the hardware to ensure there were no potential issues, I made some adjustments. However, despite these tweaks, a (new) error occurred.

   from /home/ruben/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
(gdb) bt
#0  0x00007fffa81d8902 in at::TensorIteratorBase::compute_types(at::TensorIteratorConfig const&) ()
   from /home/ruben/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#1  0x00007fffa81da8ff in at::TensorIteratorBase::build(at::TensorIteratorConfig&) ()
   from /home/ruben/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#2  0x00007fffa81dbee2 in at::TensorIteratorBase::build_borrowing_binary_op(at::TensorBase const&, at::TensorBase const&, at::TensorBase const&) ()
   from /home/ruben/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#3  0x00007fff833a417d in at::(anonymous namespace)::wrapper_CUDA_mul_Tensor(at::Tensor const&, at::Tensor const&) ()
   from /home/ruben/.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#4  0x00007fff833a4240 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::wrapper_CUDA_mul_Tensor>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) ()
--Type <RET> for more, q to quit, c to continue without paging--
o
#5  0x00007fffa8ccef4e in at::_ops::mul_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /home/ruben/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#6  0x00007fffaa81e15d in torch::autograd::VariableType::(anonymous namespace)::mul_Tensor(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) ()
   from /home/ruben/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#7  0x00007fffaa81ebe3 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::mul_Tensor>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) ()
   from /home/ruben/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#8  0x00007fffa8d2c581 in at::_ops::mul_Tensor::call(at::Tensor const&, at::Tensor const&) () from /home/ruben/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#9  0x00007fffabc191d4 in at::Tensor torch::autograd::generated::details::mul_tensor_backward<at::Tensor>(at::Tensor, at::Tensor, c10::ScalarType) ()
   from /home/ruben/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#10 0x00007fffaa72bb04 in torch::autograd::generated::MulBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) ()
   from /home/ruben/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#11 0x00007fffab2c020b in torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) () from /home/ruben/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#12 0x00007fffab2b961d in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) () from /home/ruben/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#13 0x00007fffab2ba990 in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) () from /home/ruben/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#14 0x00007fffab2b16cb in torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) ()
   from /home/ruben/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#15 0x00007fffbfee065f in torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) ()
   from /home/ruben/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so
#16 0x00007fffc1d81de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#17 0x00007ffff7da5609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#18 0x00007ffff7edf133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Is it possible to utilize this information to pinpoint the source of the issue?

ptrblck · March 20, 2023, 6:06pm

I don’t know which adjustments were made which caused the change in the stacktrace, but would still recommend creating an issue so that we could try to reproduce and debug the issue.