I’m encountering a problem while running a finetune script for my FlanT5(-large) model. The script crashes with the message
Segmentation fault (core dumped) after a few (thousand) steps. The error seems to occur randomly, sometimes within a few minutes and sometimes it takes around 40 minutes.
Previously, I had run the same script on my laptop with a smaller model and dataset. However, I needed to increase the size of both, so I used a more powerful computer that was available to me. Unfortunately, I’m encountering this segmentation error on this new setup.
Here are the details of my new setup:
- Ubuntu 22.04
- CUDA 11.7
- PyTorch build 1.13.1 (installed via
pip3 install torch torchvision torchaudio)
- I followed the instructions in this post (20.04 - Is there a way to know what really caused a specific segfault? - Ask Ubuntu) and enabled XMP, but it didn’t solve the problem.
- 2x RTX 3090 (I tried using both a single GPU and both GPUs with data parallelization)
- Around 200GB of RAM
I monitored the usage of VRAM, RAM, and CPU, but none of them exceeded their memory limits.
Here are the imports used in my finetune script:
import torch import torch.nn as nn from tqdm import tqdm from pathlib import Path from collections import OrderedDict from torch.utils.data import DataLoader, Dataset from torch.optim import AdamW from transformers import get_linear_schedule_with_warmup from transformers import AutoModelForSeq2SeqLM, AutoTokenizer from tensorboardX import SummaryWriter from transformers.optimization import Adafactor from rouge import Rouge import re import os import glob import torch import random import logging import numpy as np import json
I also tried to get more information using GDB, as suggested in this post (How to debug a Python segmentation fault? - Stack Overflow), but was unable to extract any useful debug information.
I’m currently at a loss for what to try next. Please let me know if you need any more information. Any help would be greatly appreciated. Thank you in advance.