Distributed Training with Nvidia Apex library is exiting without Error Log

Hi Team,

As part of distributed training, we are trying out Nvidia Apex library and we took care of Set OMP_NUM_THREADS in torch.distributed.launch issue. We are running standard EN-DE (English to German) NMT example given on this documentation.

We have noticed that without Apex library we can run the distributed training for EN-DE (English to German) NMT example but with Apex library we could not and surprising we haven’t had any error log there. Training starts and ends in a few seconds. NCCL debug level logs we are capturing but could not see any error trace there.

On the master node, we are getting the following logs

2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | distributed init (rank 1): env://
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | distributed init (rank 6): env://
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | distributed init (rank 3): env://
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | distributed init (rank 5): env://
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | distributed init (rank 0): env://
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | distributed init (rank 7): env://
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | distributed init (rank 4): env://
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | initialized host 10-7-6-170.cactuslabs.io as rank 7
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | distributed init (rank 2): env://
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | initialized host 10-7-6-170.cactuslabs.io as rank 4
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | initialized host 10-7-6-170.cactuslabs.io as rank 2
2020-08-12 13:52:17 | INFO | fairseq.distributed_utils | initialized host 10-7-6-170.cactuslabs.io as rank 1
2020-08-12 13:52:17 | INFO | fairseq.distributed_utils | initialized host 10-7-6-170.cactuslabs.io as rank 6
2020-08-12 13:52:17 | INFO | fairseq.distributed_utils | initialized host 10-7-6-170.cactuslabs.io as rank 3
2020-08-12 13:52:17 | INFO | fairseq.distributed_utils | initialized host 10-7-6-170.cactuslabs.io as rank 5
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | initialized host 10-7-6-170.cactuslabs.io as rank 0
10-7-6-170:2171:2171 [0] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2171:2171 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-170:2171:2171 [0] NCCL INFO NET/IB : No device found.
10-7-6-170:2171:2171 [0] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.170<0>
NCCL version 2.4.8+cuda9.2
10-7-6-170:2178:2178 [7] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2175:2175 [4] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2173:2173 [2] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2172:2172 [1] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2174:2174 [3] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2177:2177 [6] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2176:2176 [5] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2178:2178 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-170:2175:2175 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-170:2173:2173 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-170:2172:2172 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-170:2174:2174 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-170:2177:2177 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-170:2176:2176 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-170:2172:2172 [1] NCCL INFO NET/IB : No device found.
10-7-6-170:2174:2174 [3] NCCL INFO NET/IB : No device found.
10-7-6-170:2173:2173 [2] NCCL INFO NET/IB : No device found.
10-7-6-170:2176:2176 [5] NCCL INFO NET/IB : No device found.
10-7-6-170:2178:2178 [7] NCCL INFO NET/IB : No device found.
10-7-6-170:2175:2175 [4] NCCL INFO NET/IB : No device found.
10-7-6-170:2177:2177 [6] NCCL INFO NET/IB : No device found.
10-7-6-170:2174:2174 [3] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2172:2172 [1] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2173:2173 [2] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2177:2177 [6] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2176:2176 [5] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2178:2178 [7] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2175:2175 [4] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2171:2230 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff
10-7-6-170:2174:2231 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff
10-7-6-170:2178:2235 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff
10-7-6-170:2173:2233 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff
10-7-6-170:2175:2237 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff
10-7-6-170:2177:2236 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff
10-7-6-170:2172:2232 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff
10-7-6-170:2176:2234 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff
10-7-6-170:2171:2230 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance :  PHB
10-7-6-170:2172:2232 [1] NCCL INFO CUDA Dev 1[1], Socket NIC distance :  PHB
10-7-6-170:2173:2233 [2] NCCL INFO CUDA Dev 2[2], Socket NIC distance :  PHB
10-7-6-170:2174:2231 [3] NCCL INFO CUDA Dev 3[3], Socket NIC distance :  PHB
10-7-6-170:2175:2237 [4] NCCL INFO CUDA Dev 4[4], Socket NIC distance :  PHB
10-7-6-170:2176:2234 [5] NCCL INFO CUDA Dev 5[5], Socket NIC distance :  PHB
10-7-6-170:2177:2236 [6] NCCL INFO CUDA Dev 6[6], Socket NIC distance :  PHB
10-7-6-170:2178:2235 [7] NCCL INFO CUDA Dev 7[7], Socket NIC distance :  PHB
10-7-6-170:2171:2230 [0] NCCL INFO Channel 00 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
10-7-6-170:2171:2230 [0] NCCL INFO Channel 01 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
10-7-6-170:2176:2234 [5] NCCL INFO Ring 00 : 5[5] -> 6[6] via P2P/IPC
10-7-6-170:2174:2231 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
10-7-6-170:2177:2236 [6] NCCL INFO Ring 00 : 6[6] -> 7[7] via P2P/IPC
10-7-6-170:2175:2237 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
10-7-6-170:2172:2232 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
10-7-6-170:2173:2233 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
10-7-6-170:2171:2230 [0] NCCL INFO Ring 00 : 15 -> 0 [receive] via NET/Socket/0
10-7-6-170:2171:2230 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
10-7-6-170:2171:2230 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
10-7-6-170:2178:2235 [7] NCCL INFO Ring 00 : 7 -> 8 [send] via NET/Socket/0
10-7-6-170:2176:2234 [5] NCCL INFO Ring 00 : 5[5] -> 4[4] via P2P/IPC
10-7-6-170:2174:2231 [3] NCCL INFO Ring 00 : 3[3] -> 2[2] via P2P/IPC
10-7-6-170:2177:2236 [6] NCCL INFO Ring 00 : 6[6] -> 5[5] via P2P/IPC
10-7-6-170:2175:2237 [4] NCCL INFO Ring 00 : 4[4] -> 3[3] via P2P/IPC
10-7-6-170:2172:2232 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via P2P/IPC
10-7-6-170:2178:2235 [7] NCCL INFO Ring 00 : 7[7] -> 6[6] via P2P/IPC
10-7-6-170:2173:2233 [2] NCCL INFO Ring 00 : 2[2] -> 1[1] via P2P/IPC
10-7-6-170:2171:2230 [0] NCCL INFO Ring 00 : 8 -> 0 [receive] via NET/Socket/0
10-7-6-170:2171:2230 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
10-7-6-170:2177:2236 [6] NCCL INFO Ring 01 : 6[6] -> 7[7] via P2P/IPC
10-7-6-170:2176:2234 [5] NCCL INFO Ring 01 : 5[5] -> 6[6] via P2P/IPC
10-7-6-170:2174:2231 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
10-7-6-170:2175:2237 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
10-7-6-170:2172:2232 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
10-7-6-170:2173:2233 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
10-7-6-170:2178:2235 [7] NCCL INFO Ring 01 : 7 -> 8 [send] via NET/Socket/0
10-7-6-170:2171:2230 [0] NCCL INFO Ring 00 : 0 -> 8 [send] via NET/Socket/0
10-7-6-170:2177:2236 [6] NCCL INFO Ring 01 : 6[6] -> 5[5] via P2P/IPC
10-7-6-170:2176:2234 [5] NCCL INFO Ring 01 : 5[5] -> 4[4] via P2P/IPC
10-7-6-170:2174:2231 [3] NCCL INFO Ring 01 : 3[3] -> 2[2] via P2P/IPC
10-7-6-170:2175:2237 [4] NCCL INFO Ring 01 : 4[4] -> 3[3] via P2P/IPC
10-7-6-170:2173:2233 [2] NCCL INFO Ring 01 : 2[2] -> 1[1] via P2P/IPC
10-7-6-170:2176:2234 [5] NCCL INFO Trees [0] 4->5->6/-1/-1 [1] 4->5->6/-1/-1
10-7-6-170:2174:2231 [3] NCCL INFO Trees [0] 2->3->4/-1/-1 [1] 2->3->4/-1/-1
10-7-6-170:2175:2237 [4] NCCL INFO Trees [0] 3->4->5/-1/-1 [1] 3->4->5/-1/-1
10-7-6-170:2176:2234 [5] NCCL INFO comm 0x7fdcf4002540 rank 5 nranks 16 cudaDev 5 nvmlDev 5 - Init COMPLETE
10-7-6-170:2174:2231 [3] NCCL INFO comm 0x7f8280002540 rank 3 nranks 16 cudaDev 3 nvmlDev 3 - Init COMPLETE
10-7-6-170:2175:2237 [4] NCCL INFO comm 0x7f432c002540 rank 4 nranks 16 cudaDev 4 nvmlDev 4 - Init COMPLETE
10-7-6-170:2171:2230 [0] NCCL INFO Ring 01 : 15 -> 0 [receive] via NET/Socket/0
10-7-6-170:2171:2230 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
10-7-6-170:2171:2230 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
10-7-6-170:2172:2232 [1] NCCL INFO Ring 01 : 1[1] -> 0[0] via P2P/IPC
10-7-6-170:2178:2235 [7] NCCL INFO Ring 01 : 7[7] -> 6[6] via P2P/IPC
10-7-6-170:2177:2236 [6] NCCL INFO Trees [0] 5->6->7/-1/-1 [1] 5->6->7/-1/-1
10-7-6-170:2173:2233 [2] NCCL INFO Trees [0] 1->2->3/-1/-1 [1] 1->2->3/-1/-1
10-7-6-170:2178:2235 [7] NCCL INFO Trees [0] 6->7->-1/-1/-1 [1] 6->7->-1/-1/-1
10-7-6-170:2172:2232 [1] NCCL INFO Trees [0] 0->1->2/-1/-1 [1] 0->1->2/-1/-1
10-7-6-170:2171:2230 [0] NCCL INFO Ring 01 : 0 -> 8 [send] via NET/Socket/0
10-7-6-170:2177:2236 [6] NCCL INFO comm 0x7f166c002540 rank 6 nranks 16 cudaDev 6 nvmlDev 6 - Init COMPLETE
10-7-6-170:2173:2233 [2] NCCL INFO comm 0x7f934c002540 rank 2 nranks 16 cudaDev 2 nvmlDev 2 - Init COMPLETE
10-7-6-170:2178:2235 [7] NCCL INFO comm 0x7f7abc002540 rank 7 nranks 16 cudaDev 7 nvmlDev 7 - Init COMPLETE
10-7-6-170:2172:2232 [1] NCCL INFO comm 0x7f9d88002540 rank 1 nranks 16 cudaDev 1 nvmlDev 1 - Init COMPLETE
10-7-6-170:2171:2230 [0] NCCL INFO Ring 01 : 8 -> 0 [receive] via NET/Socket/0
10-7-6-170:2171:2230 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
10-7-6-170:2171:2230 [0] NCCL INFO Trees [0] -1->0->1/8/-1 [1] 8->0->1/-1/-1
10-7-6-170:2171:2230 [0] NCCL INFO Using 128 threads, Min Comp Cap 3, Trees enabled up to size 469999
10-7-6-170:2171:2230 [0] NCCL INFO comm 0x7fd6d8002540 rank 0 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE
10-7-6-170:2171:2171 [0] NCCL INFO Launch mode Parallel
2020-08-12 13:52:23 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='transformer_iwslt_de_en', attention_dropout=0.0, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='data-bin/iwslt14.tokenized.de-en', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=4, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=1024, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=6, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='env://', distributed_no_spawn=True, distributed_port=-1, distributed_rank=0, distributed_world_size=16, distributed_wrapper='DDP', dropout=0.3, empty_cache_freq=0, encoder_attention_heads=4, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=1024, encoder_layerdrop=0, encoder_layers=6, encoder_layers_to_keep=None, encoder_learned_pos=False, encoder_normalize_before=False, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=None, eval_tokenized_bleu=False, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, layernorm_embedding=False, left_pad_source='True', left_pad_target='False', load_alignments=False, localsgd_frequency=3, log_format=None, log_interval=100, lr=[0.0005], lr_scheduler='inverse_sqrt', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=8000, max_tokens_valid=8000, max_update=0, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, model_parallel_size=1, no_cross_attention=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_scale_embedding=False, no_seed_provided=True, no_token_positional_embeddings=False, nprocs_per_node=8, num_batch_buckets=0, num_workers=1, optimizer='adam', optimizer_overrides='{}', patience=-1, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints', save_interval=1, save_interval_updates=0, scoring='bleu', seed=1, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, source_lang=None, stop_time_hours=0, target_lang=None, task='translation', tensorboard_logdir='', threshold_loss_scale=None, tie_adaptive_weights=False, tokenizer=None, tpu=False, train_subset='train', truncate_source=False, update_freq=[1], upsample_primary=1, use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, warmup_init_lr=-1, warmup_updates=4000, weight_decay=0.0)
2020-08-12 13:52:23 | INFO | fairseq.tasks.translation | [de] dictionary: 8848 types
2020-08-12 13:52:23 | INFO | fairseq.tasks.translation | [en] dictionary: 6632 types
2020-08-12 13:52:23 | INFO | fairseq.data.data_utils | loaded 7283 examples from: data-bin/iwslt14.tokenized.de-en/valid.de-en.de
2020-08-12 13:52:23 | INFO | fairseq.data.data_utils | loaded 7283 examples from: data-bin/iwslt14.tokenized.de-en/valid.de-en.en
2020-08-12 13:52:23 | INFO | fairseq.tasks.translation | data-bin/iwslt14.tokenized.de-en valid de-en 7283 examples
2020-08-12 13:52:24 | INFO | fairseq_cli.train | model transformer_iwslt_de_en, criterion LabelSmoothedCrossEntropyCriterion
2020-08-12 13:52:24 | INFO | fairseq_cli.train | num. model params: 42864640 (num. trained: 42864640)
2020-08-12 13:52:24 | INFO | fairseq.utils | ***********************CUDA enviroments for all 16 workers***********************
2020-08-12 13:52:24 | INFO | fairseq.utils | rank   0: capabilities =  3.7  ; total memory = 11.173 GB ; name = Tesla K80                               
2020-08-12 13:52:24 | INFO | fairseq.utils | rank   1: capabilities =  3.7  ; total memory = 11.173 GB ; name = Tesla K80                               
2020-08-12 13:52:24 | INFO | fairseq.utils | rank   2: capabilities =  3.7  ; total memory = 11.173 GB ; name = Tesla K80                               
2020-08-12 13:52:24 | INFO | fairseq.utils | rank   3: capabilities =  3.7  ; total memory = 11.173 GB ; name = Tesla K80                               
2020-08-12 13:52:24 | INFO | fairseq.utils | rank   4: capabilities =  3.7  ; total memory = 11.173 GB ; name = Tesla K80                               
2020-08-12 13:52:24 | INFO | fairseq.utils | rank   5: capabilities =  3.7  ; total memory = 11.173 GB ; name = Tesla K80                               
2020-08-12 13:52:24 | INFO | fairseq.utils | rank   6: capabilities =  3.7  ; total memory = 11.173 GB ; name = Tesla K80                               
2020-08-12 13:52:24 | INFO | fairseq.utils | rank   7: capabilities =  3.7  ; total memory = 11.173 GB ; name = Tesla K80                               
2020-08-12 13:52:24 | INFO | fairseq.utils | rank   8: capabilities =  3.7  ; total memory = 11.173 GB ; name = Tesla K80                               
2020-08-12 13:52:24 | INFO | fairseq.utils | rank   9: capabilities =  3.7  ; total memory = 11.173 GB ; name = Tesla K80                               
2020-08-12 13:52:24 | INFO | fairseq.utils | rank  10: capabilities =  3.7  ; total memory = 11.173 GB ; name = Tesla K80                               
2020-08-12 13:52:24 | INFO | fairseq.utils | rank  11: capabilities =  3.7  ; total memory = 11.173 GB ; name = Tesla K80                               
2020-08-12 13:52:24 | INFO | fairseq.utils | rank  12: capabilities =  3.7  ; total memory = 11.173 GB ; name = Tesla K80                               
2020-08-12 13:52:24 | INFO | fairseq.utils | rank  13: capabilities =  3.7  ; total memory = 11.173 GB ; name = Tesla K80                               
2020-08-12 13:52:24 | INFO | fairseq.utils | rank  14: capabilities =  3.7  ; total memory = 11.173 GB ; name = Tesla K80                               
2020-08-12 13:52:24 | INFO | fairseq.utils | rank  15: capabilities =  3.7  ; total memory = 11.173 GB ; name = Tesla K80                               
2020-08-12 13:52:24 | INFO | fairseq.utils | ***********************CUDA enviroments for all 16 workers***********************
2020-08-12 13:52:24 | INFO | fairseq_cli.train | training on 16 devices (GPUs/TPUs)
2020-08-12 13:52:24 | INFO | fairseq_cli.train | max tokens per GPU = 8000 and max sentences per GPU = None
2020-08-12 13:52:24 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/checkpoint_last.pt
2020-08-12 13:52:24 | INFO | fairseq.trainer | loading train data for epoch 1
2020-08-12 13:52:24 | INFO | fairseq.data.data_utils | loaded 160239 examples from: data-bin/iwslt14.tokenized.de-en/train.de-en.de
2020-08-12 13:52:24 | INFO | fairseq.data.data_utils | loaded 160239 examples from: data-bin/iwslt14.tokenized.de-en/train.de-en.en
2020-08-12 13:52:24 | INFO | fairseq.tasks.translation | data-bin/iwslt14.tokenized.de-en train de-en 160239 examples
2020-08-12 13:52:25 | INFO | fairseq.optim.adam | using FusedAdam
2020-08-12 13:52:25 | INFO | fairseq_cli.train | done training in 0.0 seconds
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************

On the slave node, we are getting the following logs

2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | distributed init (rank 9): env://
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | initialized host 10-7-6-166.cactuslabs.io as rank 9
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | distributed init (rank 15): env://
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | initialized host 10-7-6-166.cactuslabs.io as rank 15
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | distributed init (rank 14): env://
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | distributed init (rank 8): env://
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | initialized host 10-7-6-166.cactuslabs.io as rank 14
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | distributed init (rank 13): env://
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | initialized host 10-7-6-166.cactuslabs.io as rank 8
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | distributed init (rank 10): env://
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | initialized host 10-7-6-166.cactuslabs.io as rank 13
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | distributed init (rank 12): env://
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | initialized host 10-7-6-166.cactuslabs.io as rank 10
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | distributed init (rank 11): env://
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | initialized host 10-7-6-166.cactuslabs.io as rank 12
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | initialized host 10-7-6-166.cactuslabs.io as rank 11
10-7-6-166:2407:2407 [4] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2407:2407 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-166:2407:2407 [4] NCCL INFO NET/IB : No device found.
10-7-6-166:2407:2407 [4] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2404:2404 [1] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2404:2404 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-166:2409:2409 [6] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2409:2409 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-166:2409:2409 [6] NCCL INFO NET/IB : No device found.
10-7-6-166:2404:2404 [1] NCCL INFO NET/IB : No device found.
10-7-6-166:2409:2409 [6] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2404:2404 [1] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2410:2410 [7] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2410:2410 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-166:2410:2410 [7] NCCL INFO NET/IB : No device found.
10-7-6-166:2410:2410 [7] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2406:2406 [3] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2406:2406 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-166:2406:2406 [3] NCCL INFO NET/IB : No device found.
10-7-6-166:2406:2406 [3] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2405:2405 [2] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2405:2405 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-166:2405:2405 [2] NCCL INFO NET/IB : No device found.
10-7-6-166:2405:2405 [2] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2408:2408 [5] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2408:2408 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-166:2408:2408 [5] NCCL INFO NET/IB : No device found.
10-7-6-166:2408:2408 [5] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2403:2403 [0] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2403:2403 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-166:2403:2403 [0] NCCL INFO NET/IB : No device found.
10-7-6-166:2403:2403 [0] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2404:2463 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff
10-7-6-166:2410:2465 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff
10-7-6-166:2407:2462 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff
10-7-6-166:2409:2464 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff
10-7-6-166:2406:2466 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff
10-7-6-166:2405:2467 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff
10-7-6-166:2408:2468 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff
10-7-6-166:2403:2469 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff
10-7-6-166:2410:2465 [7] NCCL INFO CUDA Dev 7[7], Socket NIC distance :  PHB
10-7-6-166:2403:2469 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance :  PHB
10-7-6-166:2404:2463 [1] NCCL INFO CUDA Dev 1[1], Socket NIC distance :  PHB
10-7-6-166:2405:2467 [2] NCCL INFO CUDA Dev 2[2], Socket NIC distance :  PHB
10-7-6-166:2408:2468 [5] NCCL INFO CUDA Dev 5[5], Socket NIC distance :  PHB
10-7-6-166:2406:2466 [3] NCCL INFO CUDA Dev 3[3], Socket NIC distance :  PHB
10-7-6-166:2407:2462 [4] NCCL INFO CUDA Dev 4[4], Socket NIC distance :  PHB
10-7-6-166:2409:2464 [6] NCCL INFO CUDA Dev 6[6], Socket NIC distance :  PHB
10-7-6-166:2408:2468 [5] NCCL INFO Ring 00 : 13[5] -> 14[6] via P2P/IPC
10-7-6-166:2407:2462 [4] NCCL INFO Ring 00 : 12[4] -> 13[5] via P2P/IPC
10-7-6-166:2406:2466 [3] NCCL INFO Ring 00 : 11[3] -> 12[4] via P2P/IPC
10-7-6-166:2405:2467 [2] NCCL INFO Ring 00 : 10[2] -> 11[3] via P2P/IPC
10-7-6-166:2409:2464 [6] NCCL INFO Ring 00 : 14[6] -> 15[7] via P2P/IPC
10-7-6-166:2404:2463 [1] NCCL INFO Ring 00 : 9[1] -> 10[2] via P2P/IPC
10-7-6-166:2403:2469 [0] NCCL INFO Ring 00 : 7 -> 8 [receive] via NET/Socket/0
10-7-6-166:2403:2469 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
10-7-6-166:2410:2465 [7] NCCL INFO Ring 00 : 15 -> 0 [send] via NET/Socket/0
10-7-6-166:2403:2469 [0] NCCL INFO Ring 00 : 8[0] -> 9[1] via P2P/IPC
10-7-6-166:2407:2462 [4] NCCL INFO Ring 00 : 12[4] -> 11[3] via P2P/IPC
10-7-6-166:2410:2465 [7] NCCL INFO Ring 00 : 15[7] -> 14[6] via P2P/IPC
10-7-6-166:2408:2468 [5] NCCL INFO Ring 00 : 13[5] -> 12[4] via P2P/IPC
10-7-6-166:2406:2466 [3] NCCL INFO Ring 00 : 11[3] -> 10[2] via P2P/IPC
10-7-6-166:2405:2467 [2] NCCL INFO Ring 00 : 10[2] -> 9[1] via P2P/IPC
10-7-6-166:2409:2464 [6] NCCL INFO Ring 00 : 14[6] -> 13[5] via P2P/IPC
10-7-6-166:2404:2463 [1] NCCL INFO Ring 00 : 9[1] -> 8[0] via P2P/IPC
10-7-6-166:2403:2469 [0] NCCL INFO Ring 00 : 8 -> 0 [send] via NET/Socket/0
10-7-6-166:2407:2462 [4] NCCL INFO Ring 01 : 12[4] -> 13[5] via P2P/IPC
10-7-6-166:2408:2468 [5] NCCL INFO Ring 01 : 13[5] -> 14[6] via P2P/IPC
10-7-6-166:2406:2466 [3] NCCL INFO Ring 01 : 11[3] -> 12[4] via P2P/IPC
10-7-6-166:2405:2467 [2] NCCL INFO Ring 01 : 10[2] -> 11[3] via P2P/IPC
10-7-6-166:2409:2464 [6] NCCL INFO Ring 01 : 14[6] -> 15[7] via P2P/IPC
10-7-6-166:2404:2463 [1] NCCL INFO Ring 01 : 9[1] -> 10[2] via P2P/IPC
10-7-6-166:2410:2465 [7] NCCL INFO Ring 01 : 15 -> 0 [send] via NET/Socket/0
10-7-6-166:2407:2462 [4] NCCL INFO Ring 01 : 12[4] -> 11[3] via P2P/IPC
10-7-6-166:2406:2466 [3] NCCL INFO Ring 01 : 11[3] -> 10[2] via P2P/IPC
10-7-6-166:2405:2467 [2] NCCL INFO Ring 01 : 10[2] -> 9[1] via P2P/IPC
10-7-6-166:2408:2468 [5] NCCL INFO Ring 01 : 13[5] -> 12[4] via P2P/IPC
10-7-6-166:2407:2462 [4] NCCL INFO Trees [0] 11->12->13/-1/-1 [1] 11->12->13/-1/-1
10-7-6-166:2403:2469 [0] NCCL INFO Ring 00 : 0 -> 8 [receive] via NET/Socket/0
10-7-6-166:2409:2464 [6] NCCL INFO Ring 01 : 14[6] -> 13[5] via P2P/IPC
10-7-6-166:2403:2469 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
10-7-6-166:2406:2466 [3] NCCL INFO Trees [0] 10->11->12/-1/-1 [1] 10->11->12/-1/-1
10-7-6-166:2408:2468 [5] NCCL INFO Trees [0] 12->13->14/-1/-1 [1] 12->13->14/-1/-1
10-7-6-166:2407:2462 [4] NCCL INFO comm 0x7f0ab4002540 rank 12 nranks 16 cudaDev 4 nvmlDev 4 - Init COMPLETE
10-7-6-166:2406:2466 [3] NCCL INFO comm 0x7f8e80002540 rank 11 nranks 16 cudaDev 3 nvmlDev 3 - Init COMPLETE
10-7-6-166:2408:2468 [5] NCCL INFO comm 0x7f09e8002540 rank 13 nranks 16 cudaDev 5 nvmlDev 5 - Init COMPLETE
10-7-6-166:2403:2469 [0] NCCL INFO Ring 01 : 7 -> 8 [receive] via NET/Socket/0
10-7-6-166:2403:2469 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
10-7-6-166:2403:2469 [0] NCCL INFO Ring 01 : 8[0] -> 9[1] via P2P/IPC
10-7-6-166:2404:2463 [1] NCCL INFO Ring 01 : 9[1] -> 8[0] via P2P/IPC
10-7-6-166:2405:2467 [2] NCCL INFO Trees [0] 9->10->11/-1/-1 [1] 9->10->11/-1/-1
10-7-6-166:2410:2465 [7] NCCL INFO Ring 01 : 15[7] -> 14[6] via P2P/IPC
10-7-6-166:2409:2464 [6] NCCL INFO Trees [0] 13->14->15/-1/-1 [1] 13->14->15/-1/-1
10-7-6-166:2410:2465 [7] NCCL INFO Trees [0] 14->15->-1/-1/-1 [1] 14->15->-1/-1/-1
10-7-6-166:2405:2467 [2] NCCL INFO comm 0x7fbd7c002540 rank 10 nranks 16 cudaDev 2 nvmlDev 2 - Init COMPLETE
10-7-6-166:2409:2464 [6] NCCL INFO comm 0x7f4290002540 rank 14 nranks 16 cudaDev 6 nvmlDev 6 - Init COMPLETE
10-7-6-166:2410:2465 [7] NCCL INFO comm 0x7ff674002540 rank 15 nranks 16 cudaDev 7 nvmlDev 7 - Init COMPLETE
10-7-6-166:2404:2463 [1] NCCL INFO Trees [0] 8->9->10/-1/-1 [1] 8->9->10/-1/-1
10-7-6-166:2403:2469 [0] NCCL INFO Ring 01 : 0 -> 8 [receive] via NET/Socket/0
10-7-6-166:2403:2469 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
10-7-6-166:2404:2463 [1] NCCL INFO comm 0x7fc5b4002540 rank 9 nranks 16 cudaDev 1 nvmlDev 1 - Init COMPLETE
10-7-6-166:2403:2469 [0] NCCL INFO Ring 01 : 8 -> 0 [send] via NET/Socket/0
10-7-6-166:2403:2469 [0] NCCL INFO Trees [0] 0->8->9/-1/-1 [1] -1->8->9/0/-1
10-7-6-166:2403:2469 [0] NCCL INFO comm 0x7f19d4002540 rank 8 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************

Environment Config

  • 2 nodes with 8 GPU (K80)
  • Fairseq 0.9
  • PyTorch 1.5
  • Cuda = 9.2.88
  • CudaNN = 7.6.4
  • NCCL = 2.4.8

Note: We can able to run training with Apex on a single instance with multiple GPUs. Is there anything we are missing?

Any information and help will be useful.

Thanks in advance

Were the training jobs with and without Apex done on the same machines? It seems like GPU-GPU communication is working based on the Ring/Trees logs that you pasted. It might make sense to direct this issue to the Apex GitHub repo since the training is working with vanilla DDP (we’ve also been working on bridging the gap between vanilla DDP and Apex through new features like dynamic bucketing, so the performance difference may not be as much as before).

Thank you so much @osalpekar for your replay and for sharing the valuable information.

Were the training jobs with and without Apex done on the same machines? - No, I have used cloud instances to run the experiment with and without Apex

I’ll be redirecting this issue to Nvidia Apex GitHub repo and paste the link of that issue here. So you and PyTorch team can refer that.

Once again thank you

Here is the link of same issue on Nvidia Apex library