I have seen that when training llm model on multiple GPUs. Model will be deployed on one first and then other GPUs with data. In this case GPU:0 will have memory utilised while other not to do extend. So is there a right way to distribute around all gpu’s and train model. I know their is DDP and FSDP but how can i wrap this to trainier API and train model. For simple models i could do it but not for trainer api.
What trainer api? Huggingface trainer? And whcih distributed training technique are you using?
Am trying to use fsdp with hugging face trainer api. But there isn’t any information how could i do it