Hi, I’m training a large LM on 8 A100-80GB GPUs using FSDP in HuggingFace’s Trainer. I specified the FSDP parameters as following:
fsdp: full_shard auto_wrap
fsdp_config:
fsdp_transformer_layer_cls_to_wrap:
- LlamaDecoderLayer
But when saving the model checkpoint, FSDP gives the following warning due to CUDA out of memory:
/home/ubuntu/miniconda3/envs/finetune-clm/lib/python3.10/site-
packages/torch/distributed/fsdp/_state_dict_utils.py:312: UserWarning: Failed to clone() tensor with
name _fsdp_wrapped_module.model.layers.59.mlp.gate_proj.weight on rank 6. This may mean that
this state_dict entry could point to invalid memory regions after returning from state_dict() call if this
parameter is managed by FSDP. Please check clone implementation of
_fsdp_wrapped_module.model.layers.59.mlp.gate_proj.weight. Error: CUDA out of memory. Tried to
allocate 228.00 MiB (GPU 6; 79.20 GiB total capacity; 75.32 GiB already allocated; 75.25 MiB free;
77.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting
max_split_size_mb to avoid fragmentation. See documentation for Memory Management and
PYTORCH_CUDA_ALLOC_CONF
However, there are plenty of memory spaces left during training:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:05:00.0 Off | 0 |
| N/A 50C P0 250W / 400W | 45141MiB / 81920MiB | 99% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:06:00.0 Off | 0 |
| N/A 51C P0 220W / 400W | 47039MiB / 81920MiB | 99% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:07:00.0 Off | 0 |
| N/A 48C P0 328W / 400W | 52993MiB / 81920MiB | 99% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:08:00.0 Off | 0 |
| N/A 52C P0 413W / 400W | 54209MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:09:00.0 Off | 0 |
| N/A 48C P0 94W / 400W | 47175MiB / 81920MiB | 99% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:0A:00.0 Off | 0 |
| N/A 43C P0 379W / 400W | 53687MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:0B:00.0 Off | 0 |
| N/A 43C P0 158W / 400W | 49709MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:0C:00.0 Off | 0 |
| N/A 59C P0 421W / 400W | 54341MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
It is just a warning and the training process can continue. But the saved model checkpoints are in bad shape and cannot be loaded. That means I will not be able to resume from an intermediate checkpoints.
When I tested the training job on a smaller GPU using a smaller model, FSDP can save model checkpoints without any problem, even when the GPU memory was tighter (less than 1GB free memory during training). But on the 80GB A100 GPUs, it complains CUDA OOM although there are almost 30GB free memory left during training. Are there anything I can do to get rid of this problem? Or did I miss anything?
My training env:
- torch: 2.0.0+cu118
- cuda: 11.8
- transformers: 4.48.1
Thanks in advance!