Hello,
I’m encountering a problem while running a finetune script for my FlanT5(-large) model. The script crashes with the message Segmentation fault (core dumped)
after a few (thousand) steps. The error seems to occur randomly, sometimes within a few minutes and sometimes it takes around 40 minutes.
Previously, I had run the same script on my laptop with a smaller model and dataset. However, I needed to increase the size of both, so I used a more powerful computer that was available to me. Unfortunately, I’m encountering this segmentation error on this new setup.
Here are the details of my new setup:
- Ubuntu 22.04
- CUDA 11.7
- PyTorch build 1.13.1 (installed via
pip3 install torch torchvision torchaudio
) - I followed the instructions in this post (20.04 - Is there a way to know what really caused a specific segfault? - Ask Ubuntu) and enabled XMP, but it didn’t solve the problem.
- 2x RTX 3090 (I tried using both a single GPU and both GPUs with data parallelization)
- Around 200GB of RAM
I monitored the usage of VRAM, RAM, and CPU, but none of them exceeded their memory limits.
Here are the imports used in my finetune script:
import torch
import torch.nn as nn
from tqdm import tqdm
from pathlib import Path
from collections import OrderedDict
from torch.utils.data import DataLoader, Dataset
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from tensorboardX import SummaryWriter
from transformers.optimization import Adafactor
from rouge import Rouge
import re
import os
import glob
import torch
import random
import logging
import numpy as np
import json
I also tried to get more information using GDB, as suggested in this post (How to debug a Python segmentation fault? - Stack Overflow), but was unable to extract any useful debug information.
I’m currently at a loss for what to try next. Please let me know if you need any more information. Any help would be greatly appreciated. Thank you in advance.