I have a issue about FSDP:
I have two devices and 8 gpu on each devices. Name is device1 and device2. I use pytorch 2.0 and fsdp train model.
I train a 3.5B gpt2 model in a device1 or device2. It work.
I train a 3.5B gpt2 model in device1 and device2, it work.
I train a 4B gpt2 model in a device1 or device2. It not work.
I train a 4B gpt2 model in device1 and device2, It still not work.Output:torch.cuda.OutOfMemoryError: CUDA out of memory.
I want to train a 4B model in device1 and device2 use fsdp.How should I solve this problem? thx
This is config files about fsdp:
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: NO_PREFETCH
fsdp_offload_params: true
fsdp_sharding_strategy: 1
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: GPT2Block
This is device1 output:
ubuntu-SYS-4028GR-TR:30007:30101 [0] NCCL INFO comm 0x7b44ec0 rank 8 nranks 16 cudaDev 0 busId 4000 - Init COMPLETE
ubuntu-SYS-4028GR-TR:30008:30100 [1] NCCL INFO comm 0x844ccf0 rank 9 nranks 16 cudaDev 1 busId 5000 - Init COMPLETE
ubuntu-SYS-4028GR-TR:30011:30102 [4] NCCL INFO comm 0x6b29170 rank 12 nranks 16 cudaDev 4 busId 82000 - Init COMPLETE
ubuntu-SYS-4028GR-TR:30012:30095 [5] NCCL INFO comm 0x72e6670 rank 13 nranks 16 cudaDev 5 busId 85000 - Init COMPLETE
ubuntu-SYS-4028GR-TR:30013:30097 [6] NCCL INFO comm 0x67e6340 rank 14 nranks 16 cudaDev 6 busId 86000 - Init COMPLETE
ubuntu-SYS-4028GR-TR:30009:30099 [2] NCCL INFO comm 0x83e3070 rank 10 nranks 16 cudaDev 2 busId 8000 - Init COMPLETE
ubuntu-SYS-4028GR-TR:30010:30098 [3] NCCL INFO comm 0x7dce1b0 rank 11 nranks 16 cudaDev 3 busId 9000 - Init COMPLETE
ubuntu-SYS-4028GR-TR:30015:30096 [7] NCCL INFO comm 0x8d140c0 rank 15 nranks 16 cudaDev 7 busId 8a000 - Init COMPLETE
This is device2 output:
ubuntu1-NF5588M4S:7039:7128 [0] NCCL INFO comm 0x7dd2bb0 rank 0 nranks 16 cudaDev 0 busId 4000 - Init COMPLETE
ubuntu1-NF5588M4S:7041:7135 [2] NCCL INFO comm 0x82337c0 rank 2 nranks 16 cudaDev 2 busId 7000 - Init COMPLETE
ubuntu1-NF5588M4S:7043:7133 [4] NCCL INFO comm 0x73f0d40 rank 4 nranks 16 cudaDev 4 busId c000 - Init COMPLETE
ubuntu1-NF5588M4S:7045:7132 [6] NCCL INFO comm 0x7e93c70 rank 6 nranks 16 cudaDev 6 busId e000 - Init COMPLETE
ubuntu1-NF5588M4S:7047:7130 [7] NCCL INFO comm 0x8532370 rank 7 nranks 16 cudaDev 7 busId f000 - Init COMPLETE
ubuntu1-NF5588M4S:7042:7131 [3] NCCL INFO comm 0x80c3180 rank 3 nranks 16 cudaDev 3 busId 8000 - Init COMPLETE
ubuntu1-NF5588M4S:7044:7129 [5] NCCL INFO comm 0x8403930 rank 5 nranks 16 cudaDev 5 busId d000 - Init COMPLETE
ubuntu1-NF5588M4S:7040:7134 [1] NCCL INFO comm 0x8da3d00 rank 1 nranks 16 cudaDev 1 busId 6000 - Init COMPLETE