I’m using pytorch lighting DDP training with batch size = 16, 8 (gpu per node) * 2 (2 nodes) = 16 total gpus. However, I got the following
error, which happens in ModelCheckpoint callback. There seems to be an error during synchronization between nodes when saving the model checkpoint. And I decreased the batch size to 4 and this error disappeared. Can anyone help me?
callbacks:
- type: ModelCheckpoint
every_n_train_steps: 2000
save_top_k: 30
monitor: "step"
filename: "checkpoint_{epoch}-{step}"
Stack:
[rank2]: Traceback (most recent call last):
[rank2]: File "/workspace/weiyh2@xiaopeng.com/xpilot_vision/ai_foundation/projects/e2e_aeb/main.py", line 130, in <module>
[rank2]: main()
[rank2]: File "/workspace/weiyh2@xiaopeng.com/xpilot_vision/ai_foundation/projects/e2e_aeb/main.py", line 121, in main
[rank2]: runner.train(resume_from=ckpt_path)
[rank2]: File "/workspace/weiyh2@xiaopeng.com/xpilot_vision/ai_foundation/projects/e2e_aeb/flow/runner/xflow_runner.py", line 38, in train
[rank2]: self.trainer.fit(
[rank2]: File "/workspace/weiyh2@xiaopeng.com/xpilot_vision/ai_foundation/xflow/xflow/lightning/trainer/xflow_trainer.py", line 356, in fit
[rank2]: super().fit(
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
[rank2]: call._call_and_handle_interrupt(
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank2]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank2]: return function(*args, **kwargs)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl
[rank2]: self._run(model, ckpt_path=ckpt_path)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 986, in _run
[rank2]: results = self._run_stage()
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1030, in _run_stage
[rank2]: self.fit_loop.run()
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 206, in run
[rank2]: self.on_advance_end()
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 378, in on_advance_end
[rank2]: call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=True)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 210, in _call_callback_hooks
[rank2]: fn(trainer, trainer.lightning_module, *args, **kwargs)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 323, in on_train_epoch_end
[rank2]: self._save_topk_checkpoint(trainer, monitor_candidates)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 383, in _save_topk_checkpoint
[rank2]: self._save_monitor_checkpoint(trainer, monitor_candidates)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 703, in _save_monitor_checkpoint
[rank2]: self._update_best_and_save(current, trainer, monitor_candidates)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 732, in _update_best_and_save
[rank2]: filepath = self._get_metric_interpolated_filepath_name(monitor_candidates, trainer, del_filepath)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 661, in _get_metric_interpolated_filepath_name
[rank2]: while self.file_exists(filepath, trainer) and filepath != del_filepath:
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 774, in file_exists
[rank2]: return trainer.strategy.broadcast(exists)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/ddp.py", line 307, in broadcast
[rank2]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank2]: return func(*args, **kwargs)
[rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2636, in broadcast_object_list
[rank2]: object_tensor = torch.empty( # type: ignore[call-overload]
[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory.