from transformers import AutoModelForCausalLM
import torch
import argparse
parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('--local_rank', type=int, default=-1, help='the rank of this process')
args = parser.parse_args()
torch.distributed.init_process_group(backend='nccl')
# device = torch.device("cuda", args.local_rank)
# torch.cuda.set_device(device)
model = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-neo-1.3B')
# print(f"local rank {args.local_rank} model loaded")
# model = model.to(device)
My remote server spec:
CPU-> Intel(R) Xeon(R) CPU E5-2695 v4, 72 CPUs.
I have 252 GB of memory according to /proc/meminfo (I don’t understand. Vendor information claims it has 1.5TB system memory…!)
grep MemFree /proc/meminfo
MemFree: 260498324 kB
GPU-> Titan XP, 8 GPUs. [GPU memory 12 GB]
I know I can’t fit the model (22.9GB) in the single GPU, but I’m planning to use deepspeed.
The problem is, I can’t load the model even in CPU.
I can load 5 models, but not 8. (python -m torch.distributed.launch --nproc_per_node=5 trial.py
Works.) Why??
I should have enough CPU memory to hold more than 10 models in CPU.
Waiting for the help badly ;-(
Each process would try to load the model into the host RAM and with ~10*22.9GB you might be too close to the max. available memory as the OS, Python, etc. would also need to be loaded into the RAM. You can certainly observe how much RAM is needed and if you are indeed running out of host RAM.
You should definitely check where the rest of the 1.5TB of host RAM disappears and could also check the meta device which might avoid OOM issues as described in this blog post.
Hello, thank you very much for the answer! I was thinking that the child processes will be executed on different CPUs by load balancing or something.
I tried top command on Linux, but %MEM never exceeded dangerously. Processes just ended.
How could I observe whether I am running out of host RAM? If you don’t mind would you suggest me some ways? Thank you again!
Yes, different CPU cores can execute different processes, but I don’t understand how this is related to the OOM.
In that case try to narrow down why the process just ended and at which part of your script it was.
Running it with gdb might be a good way to check if and where it crashes.
Yes, different CPU cores can execute different processes, but I don’t understand how this is related to the OOM.
I found my mistake. MemFree in /proc/meminfo displays the total available memory in the system, right? Then using multicore will not change anything, you’re right!
At first, I thought MemFree in /proc/meminfo was per core. => If each CPU core has 252GB (if not 1.5TB) of free RAM space, and if all processes are executed on different cores, the models will be loaded in different CPUs → there is no reason for CPUs to suffer OOM since the model is only 22.9GB… Such a wrong idea…!
I still can’t understand why the system memory is lesser than the spec in the intel website… MemTotal is much lower than 1.5TB, too.
cat /proc/meminfo
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 72
On-line CPU(s) list: 0-71
Thread(s) per core: 2
Core(s) per socket: 18
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
Stepping: 1
CPU MHz: 1202.173
CPU max MHz: 3300.0000
CPU min MHz: 1200.0000
BogoMIPS: 4199.86
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-17,36-53
NUMA node1 CPU(s): 18-35,54-71
And thank you! I’m asking questions over questions, but you helped me a lot !! I will debug the processes with gdb that you recommended and find where it crashed (although I’m a little bit worried about multicore debugging. pdb library was not useful in that case…) &
figure out why the system memory is not 1.5TB but 252GB &
try Meta tensor if I need it! (it looks like DeepSpeed’s LayerSpec! The idea is really cool)
Will update the results here
You should be able to run sudo dmidecode -t 17 to see information about your RAM slots and how they are used in your system to figure our how much memory is really built into the system.
I used sudo dmidecode -t 17 and found out that I only had 4 32GB RAMs (so 256GB was the actual system memory alright)
I gave up on training 6B model on 8 Titan XP GPUs because 96GB was exactly a memory of model state’s size… so there was no place for activation.
After changing it to 1.3B model, I didn’t suffered any CPU or GPU OOM.
Thank you for giving information about the way to see actual memory usage instead of the manufacturer spec. I’ll definitely try gdb when I get stuck again in distributed training!