(Out of CPU memory?) Error if I try to train 6B GPT-j model with 8 Titan XP GPUs

Hello everyone :smiley:
I’m stuck with my remote server, trying to train huggingface EleutherAI/gpt-j-6B model.

minimal code example (no training. Just loading)

command:
python -m torch.distributed.launch --nproc_per_node=8 trial.py

minimal runnable code trial.py

from transformers import AutoModelForCausalLM
import torch

import argparse


parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('--local_rank', type=int, default=-1, help='the rank of this process')

args = parser.parse_args()

torch.distributed.init_process_group(backend='nccl')
# device = torch.device("cuda", args.local_rank)
# torch.cuda.set_device(device)
model = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-neo-1.3B')
# print(f"local rank {args.local_rank} model loaded")
# model = model.to(device)

My remote server spec:
CPU-> Intel(R) Xeon(R) CPU E5-2695 v4, 72 CPUs.
I have 252 GB of memory according to /proc/meminfo (I don’t understand. Vendor information claims it has 1.5TB system memory…!)

grep MemFree /proc/meminfo
MemFree:        260498324 kB

GPU-> Titan XP, 8 GPUs. [GPU memory 12 GB]

I know I can’t fit the model (22.9GB) in the single GPU, but I’m planning to use deepspeed.
The problem is, I can’t load the model even in CPU.
I can load 5 models, but not 8. (python -m torch.distributed.launch --nproc_per_node=5 trial.py
Works.) Why??
I should have enough CPU memory to hold more than 10 models in CPU.
Waiting for the help badly ;-(

Each process would try to load the model into the host RAM and with ~10*22.9GB you might be too close to the max. available memory as the OS, Python, etc. would also need to be loaded into the RAM. You can certainly observe how much RAM is needed and if you are indeed running out of host RAM.

You should definitely check where the rest of the 1.5TB of host RAM disappears and could also check the meta device which might avoid OOM issues as described in this blog post.

1 Like

Hello, thank you very much for the answer! I was thinking that the child processes will be executed on different CPUs by load balancing or something.
I tried top command on Linux, but %MEM never exceeded dangerously. Processes just ended.
How could I observe whether I am running out of host RAM? If you don’t mind would you suggest me some ways? Thank you again!

Yes, different CPU cores can execute different processes, but I don’t understand how this is related to the OOM.

In that case try to narrow down why the process just ended and at which part of your script it was.
Running it with gdb might be a good way to check if and where it crashes.

1 Like

Yes, different CPU cores can execute different processes, but I don’t understand how this is related to the OOM.

I found my mistake. MemFree in /proc/meminfo displays the total available memory in the system, right? Then using multicore will not change anything, you’re right!
At first, I thought MemFree in /proc/meminfo was per core. => If each CPU core has 252GB (if not 1.5TB) of free RAM space, and if all processes are executed on different cores, the models will be loaded in different CPUs → there is no reason for CPUs to suffer OOM since the model is only 22.9GB… Such a wrong idea…! :joy:

I still can’t understand why the system memory is lesser than the spec in the intel website… MemTotal is much lower than 1.5TB, too.
cat /proc/meminfo

MemTotal:       264021392 kB
MemFree:        260542184 kB
MemAvailable:   260915024 kB
Buffers:          110824 kB
Cached:          1608480 kB
SwapCached:        27380 kB
Active:          1592252 kB
Inactive:         330768 kB
Active(anon):     169248 kB
Inactive(anon):    30812 kB
Active(file):    1423004 kB
Inactive(file):   299956 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       8388604 kB
SwapFree:        8241244 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:        197492 kB
Mapped:           114064 kB
Shmem:               472 kB
Slab:             807944 kB
SReclaimable:     290396 kB
SUnreclaim:       517548 kB
KernelStack:       17104 kB
PageTables:        11620 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    140399300 kB
Committed_AS:    2538324 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:     2792812 kB
DirectMap2M:    99852288 kB
DirectMap1G:    167772160 kB

cpuinfo

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              72
On-line CPU(s) list: 0-71
Thread(s) per core:  2
Core(s) per socket:  18
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
Stepping:            1
CPU MHz:             1202.173
CPU max MHz:         3300.0000
CPU min MHz:         1200.0000
BogoMIPS:            4199.86
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0-17,36-53
NUMA node1 CPU(s):   18-35,54-71

And thank you! I’m asking questions over questions, but you helped me a lot !! I will debug the processes with gdb that you recommended and find where it crashed (although I’m a little bit worried about multicore debugging. pdb library was not useful in that case…) &
figure out why the system memory is not 1.5TB but 252GB &
try Meta tensor if I need it! (it looks like DeepSpeed’s LayerSpec! The idea is really cool)
Will update the results here :smiley:

You should be able to run sudo dmidecode -t 17 to see information about your RAM slots and how they are used in your system to figure our how much memory is really built into the system.

I used sudo dmidecode -t 17 and found out that I only had 4 32GB RAMs (so 256GB was the actual system memory alright)

I gave up on training 6B model on 8 Titan XP GPUs because 96GB was exactly a memory of model state’s size… so there was no place for activation.
After changing it to 1.3B model, I didn’t suffered any CPU or GPU OOM.
Thank you for giving information about the way to see actual memory usage instead of the manufacturer spec. I’ll definitely try gdb when I get stuck again in distributed training!

1 Like