When not specifying -CUDA_VISIBLE_DEVICES, I get OOM error

With a command like this:

CUDA_VISIBLE_DEVICES=0 PYTHONPATH='.' python run_train.py \
--max_seq_length 512 \
--image_size 224 \
--max_seq_length_decoder 16 \
--max_steps 800 \
--label_names "labels" \
--do_train true \
--do_eval true \

But if I remove the CUDA_VISIBLE_DEVICES=0 or if I change it like this CUDA_VISIBLE_DEVICES=0,1 or this CUDA_VISIBLE_DEVICES=0,1,2, then it fails with the message

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate

I’d really like to run this with DP (not DDP) because I have very large batch sizes and I’d like to make them larger.

Please lmk how to resolve this. Thank you.

Could you describe your setup and code a bit more? Are you using different device IDs in your code and do all GPUs have the same memory? Are you able to use each one separately?

Yes. I’m working on a computer with 3 identical GPU’s

As you can see, the device ID’s are the same. Yes, I can use each GPU separately and it works fine.

I’m using the HuggingFace Trainer like this:

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset if training_args.do_train else None,
        eval_dataset=eval_dataset if training_args.do_eval else None,
#        tokenizer=tokenizer,
        data_collator=simple_collate_fn,
        compute_metrics=compute_metrics_f1,
        preprocess_logits_for_metrics=preprocess_logits_for_metrics_wayne,
    )

Please lmk if you would like any more info. Thank you.

Could you post the full error message showing how much memory was used as well as the memory usage on each device when the OOM error is raised?

Here is the error message and screenshot just before OOM occurs:

Traceback (most recent call last):
  File "run_train.py", line 557, in <module>
    main()
  File "run_train.py", line 499, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/transformers/trainer.py", line 2751, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/transformers/trainer.py", line 2780, in compute_loss
    outputs = model(**inputs)
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 172, in forward
    return self.gather(outputs, self.output_device)
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 184, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 86, in gather
    res = gather_map(outputs)
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 77, in gather_map
    return type(out)((k, gather_map([d[k] for d in outputs]))
  File "<string>", line 12, in __init__
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/transformers/utils/generic.py", line 277, in __post_init__
    for idx, element in enumerate(iterator):
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 77, in <genexpr>
    return type(out)((k, gather_map([d[k] for d in outputs]))
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 81, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 71, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 75, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 235, in gather
    return torch._C._gather(tensors, dim, destination)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 92.00 MiB (GPU 0; 23.69 GiB total capacity; 20.63 GiB already allocated; 92.31 MiB free; 22.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|

Also, here is the pip freeze if that’s helpful

absl-py==1.4.0
accelerate==0.23.0
aiohttp==3.8.5
aiosignal==1.3.1
async-timeout==4.0.3
attrs==23.1.0
cachetools==5.3.1
catalyst==22.4
certifi==2023.7.22
charset-normalizer==3.2.0
cmake==3.27.4.1
CoLT5-attention==0.10.15
datasets==2.14.5
dill==0.3.7
einops==0.6.1
filelock==3.12.4
frozenlist==1.4.0
fsspec==2023.6.0
google-auth==2.23.0
google-auth-oauthlib==1.0.0
grpcio==1.58.0
huggingface-hub==0.17.1
hydra-slayer==0.4.1
idna==3.4
importlib-metadata==6.8.0
Jinja2==3.1.2
joblib==1.3.2
jsonlines==4.0.0
lightning-utilities==0.9.0
lit==16.0.6
local-attention==1.8.6
Markdown==3.4.4
MarkupSafe==2.1.3
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.15
networkx==3.1
numpy==1.24.4
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
oauthlib==3.2.2
packaging==23.1
pandas==2.0.3
pdf2image==1.16.0
Pillow==10.0.0
product-key-memory==0.2.10
protobuf==3.20.1
psutil==5.9.5
pyarrow==13.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pyDeprecate==0.3.2
python-dateutil==2.8.2
pytorch-lightning==1.6.5
pytz==2023.3.post1
PyYAML==6.0.1
regex==2023.8.8
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
safetensors==0.3.3
scikit-learn==1.3.0
scipy==1.10.1
sentencepiece==0.1.99
seqeval==1.2.2
six==1.16.0
sympy==1.12
tensorboard==2.14.0
tensorboard-data-server==0.7.1
tensorboardX==2.6.2.2
threadpoolctl==3.2.0
timm==0.4.12
tokenizers==0.13.3
torch==2.0.1
torchmetrics==1.1.2
torchvision==0.15.2
tqdm==4.66.1
transformers==4.30.0
triton==2.0.0
typing_extensions==4.7.1
tzdata==2023.3
urllib3==1.26.16
Werkzeug==2.3.7
xxhash==3.3.0
yarl==1.9.2
zipp==3.16.2

Based on the stacktrace it seems nn.DataParallel is used which is known to create an imbalanced GPU memory usage as also seen in the output of nvidia-smi (~24GB on the default GPU0 while GPU1 and GPU2 use ~12GB). Use DistributedDataParallel to avoid this imbalance and to allow each GPU to use the same amount of their memory.

Thank you.

My understanding is that DDP will parallelize the data. And we can see that each batch > 24GB. So I think that will probably cause OOM. No?

DDP will create model copies on each device and shard the data onto each. Similar to how you are already using nn.DataParallel but DDP avoids the overheads and imbalanced memory usage.

No, we see that the default device uses ~2x the memory of the other devices and is causing the OOM, since nn.DataParallel creates an imbalanced memory usage as it shards the data from the default device and gathers the outputs on it as well.

OK, thank you for the detailed explanation. I got an error why trying DDP:

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 981117) of binary: /home/wayne/dev/pop-repos/model-train/wve/bin/python
Traceback (most recent call last):
  File "/home/wayne/.pyenv/versions/3.8.10/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/wayne/.pyenv/versions/3.8.10/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/wayne/dev/pop-repos/model-train/wve/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-09-15_14:47:07
  host      : host-0
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 981118)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-09-15_14:47:07
  host      : host-0
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 981119)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-15_14:47:07
  host      : host-0
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 981117)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

The stacktrace doesn’t show any errors, so make sure you are able to run the DDP tutorial first.

Sorry I’m confused. This stacktrace appears to me to show a number of errors. No?

For DDP, I used

python -m torch.distributed.launch --nproc_per_node 2 run_train.py

as advised here:
Efficient Training on Multiple GPUs.

Or is there a better tutorial? In the past, I have directly used dist.init_process_group(backend='nccl', world_size=args.num_gpus, rank=gpu) etc to do DDP. But it seems HF recommends the above. Or maybe just use PyTorch Lightning?

Thank you.