(Out of CPU memory?) Error if I try to train 6B GPT-j model with 8 Titan XP GPUs

t-hyun · July 20, 2022, 8:50am

Hello everyone
I’m stuck with my remote server, trying to train huggingface EleutherAI/gpt-j-6B model.

minimal code example (no training. Just loading)

command:
python -m torch.distributed.launch --nproc_per_node=8 trial.py

minimal runnable code trial.py

from transformers import AutoModelForCausalLM
import torch

import argparse


parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('--local_rank', type=int, default=-1, help='the rank of this process')

args = parser.parse_args()

torch.distributed.init_process_group(backend='nccl')
# device = torch.device("cuda", args.local_rank)
# torch.cuda.set_device(device)
model = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-neo-1.3B')
# print(f"local rank {args.local_rank} model loaded")
# model = model.to(device)

My remote server spec:
CPU-> Intel(R) Xeon(R) CPU E5-2695 v4, 72 CPUs.
I have 252 GB of memory according to /proc/meminfo (I don’t understand. Vendor information claims it has 1.5TB system memory…!)

grep MemFree /proc/meminfo
MemFree:        260498324 kB

GPU-> Titan XP, 8 GPUs. [GPU memory 12 GB]

I know I can’t fit the model (22.9GB) in the single GPU, but I’m planning to use deepspeed.
The problem is, I can’t load the model even in CPU.
I can load 5 models, but not 8. (python -m torch.distributed.launch --nproc_per_node=5 trial.py
Works.) Why??
I should have enough CPU memory to hold more than 10 models in CPU.
Waiting for the help badly ;-(