Pytorch lightning causes slurm nodes to drain

meshghi · October 5, 2022, 7:54pm

Hello! When I train with DDP strategy, any type of crashes like Out Of Memory (OOM) error or scancel slurm job results in slurm nodes to drain due to Kill task failed which means that the pytorch lightning process running on these nodes failed to clean up after termination. I was wondering how I could fix this?

This is with multi-node, 8 GPUs per node. Tried on 2,3,4,and 5 nodes and all cases cause the issue. Feels like the higher the nodes we try, the higher the probability that one or more will go into the drain state.

callbacks_list = [
        lr_monitor,
        checkpoint_callback,
        swa_ensemble,
        PrintCallback(),
        callbacks.ModelSummary(),
        callbacks.DeviceStatsMonitor(cpu_stats=True),
        ]

logger = [
        loggers.tensorboard.TensorBoardLogger(save_dir="./logs", version=slurm_job_id), 
        ]

trainer = pl.Trainer(
            gpus=_cfg.slurm_job.gpus_per_node,
            num_nodes=_cfg.slurm_job.number_of_nodes, 
            accelerator="gpu",
            strategy=DDPStrategy(find_unused_parameters=False),
            plugins=pl.plugins.environments.SLURMEnvironment(auto_requeue=False),
            max_epochs=_cfg.training.fit.epochs, 
            callbacks=callbacks_list,
            logger=logger,
            accumulate_grad_batches=_cfg.training.fit.accumulation_steps,
            profiler="simple",
            )
data_module = DataModule(_cfg)
model_module = LitModelModule(_cfg)

trainer.fit(
            model=model_module, 
            datamodule=data_module,
            ckpt_path=None
        )

Environment:

- PyTorch Lightning Version (e.g., 1.5.0): 1.7.7
- PyTorch Version (e.g., 1.10): 1.12.0
- Python version (e.g., 3.9): 3.9.12
- OS (e.g., Linux): Linux --RHEL7.4 
- CUDA/cuDNN version: 11.7
- GPU models and configuration: RTX 5000's