My multi-GPU training OAR job keeps being killed

Hello,

I encountered a very strange problem that repeatedly happened.

I have a training started with the following command on a Linux server:

oarsub -l "host=1/gpuid=4,walltime=480:0:0" \
"/home/username/.env/py37/bin/python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --coco_path /data/coco --output_dir /home/username/code/output --resume /home/username/code/output/checkpoint.pth"

After a few hours, the training was killed. And this happened every time I restarted it. Our system admin could not figure out what was wrong.

The std error messages (content of OAR.<jobID>.stderr) are the following:

Traceback (most recent call last):
  File "/home/username/.local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/username/.local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/username/.env/py37/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/username/.env/py37/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/username/.env/py37/bin/python', '-u', 'main.py', '--coco_path', '/data/coco', '--output_dir', '/home/username/code/output', '--resume', '/home/username/code/output/checkpoint.pth']' died with <Signals.SIGKILL: 9>.

In the std output file OAR.<jobID>.stdout, the last lines are the following:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


This message is displayed only at the end of OAR.<jobID>.stdout, when the crash happened, so maybe it has something to do with the crash.

Could you please help? Thank you very much in advance!

1 Like

This is printed immediately after you run launch.py. See the code below:

During the few hours when the job is running, did the DistributedDataParallel training making progress as expected? You can, e.g., print some logs in every iteration to check this.

And since the command line contains a path to checkpoint, I assume the training job would write into that checkpoint file periodically? Did the job successfully generated any checkpoint?

Thanks for your reply. The training made progress (and saved checkpoints) for a few hours before being killed.

Here’s the beginning of the log file (head -n 20 OAR.<jobID>.stdout):

| distributed init (rank 2): env://
| distributed init (rank 3): env://
| distributed init (rank 0): env://
| distributed init (rank 1): env://
git:
sha: ae03a2d6e52a9ec1b67f85437d0a275c5abbe9ac, status: has uncommited changes, branch: master

Namespace(aux_loss=True, backbone=‘resnet50’, batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path=‘/data/coco’, dataset_file=‘coco’, dec_layers=6, device=‘cuda’, dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_backend=‘nccl’, dist_url=‘env://’, distributed=True, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=300, eval=False, frozen_weights=None, giou_loss_coef=2, gpu=0, hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=200, mask_loss_coef=1, masks=False, nheads=8, num_queries=100, num_workers=2, output_dir=‘/home/username/code/output’, position_embedding=‘sine’, pre_norm=False, rank=0, remove_difficult=False, resume=‘/home/username/code/output/checkpoint.pth’, seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=4)
number of params: 41302368
loading annotations into memory…
Done (t=22.53s)
creating index…
index created!
loading annotations into memory…
Done (t=0.75s)
creating index…
index created!
Start training
Epoch: [23] [ 0/14786] eta: 7:42:07 lr: 0.000100 class_error: 22.68 loss: 10.4300 (10.4300) loss_bbox: 0.3688 (0.3688) loss_bbox_0: 0.3812 (0.3812) loss_bbox_1: 0.4038 (0.4038) loss_bbox_2: 0.3718 (0.3718) loss_bbox_3: 0.3781 (0.3781) loss_bbox_4: 0.3690 (0.3690) loss_ce: 0.5279 (0.5279) loss_ce_0: 0.6643 (0.6643) loss_ce_1: 0.5894 (0.5894) loss_ce_2: 0.5849 (0.5849) loss_ce_3: 0.5311 (0.5311) loss_ce_4: 0.5083 (0.5083) loss_giou: 0.8055 (0.8055) loss_giou_0: 0.8359 (0.8359) loss_giou_1: 0.7730 (0.7730) loss_giou_2: 0.7711 (0.7711) loss_giou_3: 0.7646 (0.7646) loss_giou_4: 0.8013 (0.8013) cardinality_error_unscaled: 8.8750 (8.8750) cardinality_error_0_unscaled: 13.2500 (13.2500) cardinality_error_1_unscaled: 12.7500 (12.7500) cardinality_error_2_unscaled: 8.1250 (8.1250) cardinality_error_3_unscaled: 8.1250 (8.1250) cardinality_error_4_unscaled: 8.6250 (8.6250) class_error_unscaled: 22.6786 (22.6786) loss_bbox_unscaled: 0.0738 (0.0738) loss_bbox_0_unscaled: 0.0762 (0.0762) loss_bbox_1_unscaled: 0.0808 (0.0808) loss_bbox_2_unscaled: 0.0744 (0.0744) loss_bbox_3_unscaled: 0.0756 (0.0756) loss_bbox_4_unscaled: 0.0738 (0.0738) loss_ce_unscaled: 0.5279 (0.5279) loss_ce_0_unscaled: 0.6643 (0.6643) loss_ce_1_unscaled: 0.5894 (0.5894) loss_ce_2_unscaled: 0.5849 (0.5849) loss_ce_3_unscaled: 0.5311 (0.5311) loss_ce_4_unscaled: 0.5083 (0.5083) loss_giou_unscaled: 0.4027 (0.4027) loss_giou_0_unscaled: 0.4180 (0.4180) loss_giou_1_unscaled: 0.3865 (0.3865) loss_giou_2_unscaled: 0.3855 (0.3855) loss_giou_3_unscaled: 0.3823 (0.3823) loss_giou_4_unscaled: 0.4006 (0.4006) time: 1.8753 data: 0.4317 max mem: 2509
Epoch: [23] [ 10/14786] eta: 2:39:48 lr: 0.000100 class_error: 30.30 loss: 11.0897 (10.6174) loss_bbox: 0.3473 (0.3555) loss_bbox_0: 0.3888 (0.3989) loss_bbox_1: 0.3834 (0.3796) loss_bbox_2: 0.3662 (0.3772) loss_bbox_3: 0.3590 (0.3603) loss_bbox_4: 0.3520 (0.3548) loss_ce: 0.5279 (0.5271) loss_ce_0: 0.6043 (0.6137) loss_ce_1: 0.5870 (0.5653) loss_ce_2: 0.5627 (0.5542) loss_ce_3: 0.5400 (0.5402) loss_ce_4: 0.5083 (0.5214) loss_giou: 0.8325 (0.8231) loss_giou_0: 0.9057 (0.8922) loss_giou_1: 0.8793 (0.8482) loss_giou_2: 0.8800 (0.8514) loss_giou_3: 0.8392 (0.8296) loss_giou_4: 0.8540 (0.8247) cardinality_error_unscaled: 9.1250 (10.3182) cardinality_error_0_unscaled: 13.2500 (13.6477) cardinality_error_1_unscaled: 12.7500 (12.3068) cardinality_error_2_unscaled: 9.6250 (10.9091) cardinality_error_3_unscaled: 9.1250 (10.1705) cardinality_error_4_unscaled: 9.1250 (10.0341) class_error_unscaled: 22.6786 (23.9301) loss_bbox_unscaled: 0.0695 (0.0711) loss_bbox_0_unscaled: 0.0778 (0.0798) loss_bbox_1_unscaled: 0.0767 (0.0759) loss_bbox_2_unscaled: 0.0732 (0.0754) loss_bbox_3_unscaled: 0.0718 (0.0721) loss_bbox_4_unscaled: 0.0704 (0.0710) loss_ce_unscaled: 0.5279 (0.5271) loss_ce_0_unscaled: 0.6043 (0.6137) loss_ce_1_unscaled: 0.5870 (0.5653) loss_ce_2_unscaled: 0.5627 (0.5542) loss_ce_3_unscaled: 0.5400 (0.5402) loss_ce_4_unscaled: 0.5083 (0.5214) loss_giou_unscaled: 0.4162 (0.4116) loss_giou_0_unscaled: 0.4528 (0.4461) loss_giou_1_unscaled: 0.4397 (0.4241) loss_giou_2_unscaled: 0.4400 (0.4257) loss_giou_3_unscaled: 0.4196 (0.4148) loss_giou_4_unscaled: 0.4270 (0.4123) time: 0.6489 data: 0.0493 max mem: 3574

And here’s the end (tail -n 4 OAR.<jobID>.stdout):

Epoch: [23] [ 4730/14786] eta: 1:07:17 lr: 0.000100 class_error: 40.61 loss: 9.2283 (10.3881) loss_bbox: 0.3610 (0.3562) loss_bbox_0: 0.4111 (0.4107) loss_bbox_1: 0.3788 (0.3752) loss_bbox_2: 0.3783 (0.3661) loss_bbox_3: 0.3680 (0.3598) loss_bbox_4: 0.3660 (0.3571) loss_ce: 0.4747 (0.5366) loss_ce_0: 0.5628 (0.6116) loss_ce_1: 0.5367 (0.5860) loss_ce_2: 0.5133 (0.5601) loss_ce_3: 0.4722 (0.5455) loss_ce_4: 0.4595 (0.5364) loss_giou: 0.7070 (0.7790) loss_giou_0: 0.7854 (0.8533) loss_giou_1: 0.7170 (0.8021) loss_giou_2: 0.7175 (0.7903) loss_giou_3: 0.7327 (0.7818) loss_giou_4: 0.7180 (0.7802) cardinality_error_unscaled: 7.7500 (9.0408) cardinality_error_0_unscaled: 10.8750 (11.5247) cardinality_error_1_unscaled: 10.1250 (11.1548) cardinality_error_2_unscaled: 8.3750 (9.9196) cardinality_error_3_unscaled: 7.3750 (9.3645) cardinality_error_4_unscaled: 7.7500 (9.0276) class_error_unscaled: 30.4464 (32.1511) loss_bbox_unscaled: 0.0722 (0.0712) loss_bbox_0_unscaled: 0.0822 (0.0821) loss_bbox_1_unscaled: 0.0758 (0.0750) loss_bbox_2_unscaled: 0.0757 (0.0732) loss_bbox_3_unscaled: 0.0736 (0.0720) loss_bbox_4_unscaled: 0.0732 (0.0714) loss_ce_unscaled: 0.4747 (0.5366) loss_ce_0_unscaled: 0.5628 (0.6116) loss_ce_1_unscaled: 0.5367 (0.5860) loss_ce_2_unscaled: 0.5133 (0.5601) loss_ce_3_unscaled: 0.4722 (0.5455) loss_ce_4_unscaled: 0.4595 (0.5364) loss_giou_unscaled: 0.3535 (0.3895) loss_giou_0_unscaled: 0.3927 (0.4267) loss_giou_1_unscaled: 0.3585 (0.4011) loss_giou_2_unscaled: 0.3587 (0.3952) loss_giou_3_unscaled: 0.3664 (0.3909) loss_giou_4_unscaled: 0.3590 (0.3901) time: 0.4095 data: 0.0113 max mem: 7106


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


The OMP message is at the very end of the log file, so it doesn’t seem to be printed right after launch.

Please let me know if you need further information. Thanks!

hmm, this is weird. One possibility could be the print buffer from the main process wasn’t full at the beginning and then was only flushed on exit, leading to the message to be shown in the end. But I am not sure if this is case.

One reason for this behavior might be some DDP process hits OOM (or some other error) after a while and crashed, causing other DDP processes to hang. Did you try to do any try-except around DDP in main.py? If so, you might want to try https://pytorch.org/elastic, as try-except in one process could lead to DDP communication de-sync/hang/timeout.

In the code, the print function is tweaked to print on the master process only (and do nothing on the others), so the content of OAR.<jobID>.stdout comes from the master. The thing I’m not sure about is the file OAR.<jobID>.stderr. If there is an OOM error, on any process, then the error message should be added to OAR.<jobID>.stderr, right? Because this has nothing to do with the print function I guess.

Hi,

I faced the same problem as you. This error usually caused by the compiling process of various libraries. I created a brand-new environment using Anaconda and train again. This error disappeared.

I hope this information could help you.

1 Like

Yes, So clever you are, I solve the problem according to your answer, thank you.