Hello everyone ![]()
I’m stuck with my remote server, trying to train huggingface EleutherAI/gpt-j-6B model.
minimal code example (no training. Just loading)
command:
python -m torch.distributed.launch --nproc_per_node=8 trial.py
minimal runnable code trial.py
from transformers import AutoModelForCausalLM
import torch
import argparse
parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('--local_rank', type=int, default=-1, help='the rank of this process')
args = parser.parse_args()
torch.distributed.init_process_group(backend='nccl')
# device = torch.device("cuda", args.local_rank)
# torch.cuda.set_device(device)
model = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-neo-1.3B')
# print(f"local rank {args.local_rank} model loaded")
# model = model.to(device)
My remote server spec:
CPU-> Intel(R) Xeon(R) CPU E5-2695 v4, 72 CPUs.
I have 252 GB of memory according to /proc/meminfo (I don’t understand. Vendor information claims it has 1.5TB system memory…!)
grep MemFree /proc/meminfo
MemFree: 260498324 kB
GPU-> Titan XP, 8 GPUs. [GPU memory 12 GB]
I know I can’t fit the model (22.9GB) in the single GPU, but I’m planning to use deepspeed.
The problem is, I can’t load the model even in CPU.
I can load 5 models, but not 8. (python -m torch.distributed.launch --nproc_per_node=5 trial.py
Works.) Why??
I should have enough CPU memory to hold more than 10 models in CPU.
Waiting for the help badly ;-(