Hi Team,
As part of distributed training, we are trying out Nvidia Apex library and we took care of Set OMP_NUM_THREADS in torch.distributed.launch issue. We are running standard EN-DE (English to German) NMT example given on this documentation.
We have noticed that without Apex library we can run the distributed training for EN-DE (English to German) NMT example but with Apex library we could not and surprising we haven’t had any error log there. Training starts and ends in a few seconds. NCCL debug level logs we are capturing but could not see any error trace there.
On the master node, we are getting the following logs
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | distributed init (rank 1): env://
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | distributed init (rank 6): env://
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | distributed init (rank 3): env://
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | distributed init (rank 5): env://
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | distributed init (rank 0): env://
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | distributed init (rank 7): env://
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | distributed init (rank 4): env://
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | initialized host 10-7-6-170.cactuslabs.io as rank 7
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | distributed init (rank 2): env://
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | initialized host 10-7-6-170.cactuslabs.io as rank 4
2020-08-12 13:52:16 | INFO | fairseq.distributed_utils | initialized host 10-7-6-170.cactuslabs.io as rank 2
2020-08-12 13:52:17 | INFO | fairseq.distributed_utils | initialized host 10-7-6-170.cactuslabs.io as rank 1
2020-08-12 13:52:17 | INFO | fairseq.distributed_utils | initialized host 10-7-6-170.cactuslabs.io as rank 6
2020-08-12 13:52:17 | INFO | fairseq.distributed_utils | initialized host 10-7-6-170.cactuslabs.io as rank 3
2020-08-12 13:52:17 | INFO | fairseq.distributed_utils | initialized host 10-7-6-170.cactuslabs.io as rank 5
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | initialized host 10-7-6-170.cactuslabs.io as rank 0
10-7-6-170:2171:2171 [0] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2171:2171 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-170:2171:2171 [0] NCCL INFO NET/IB : No device found.
10-7-6-170:2171:2171 [0] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.170<0>
NCCL version 2.4.8+cuda9.2
10-7-6-170:2178:2178 [7] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2175:2175 [4] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2173:2173 [2] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2172:2172 [1] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2174:2174 [3] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2177:2177 [6] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2176:2176 [5] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2178:2178 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-170:2175:2175 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-170:2173:2173 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-170:2172:2172 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-170:2174:2174 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-170:2177:2177 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-170:2176:2176 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-170:2172:2172 [1] NCCL INFO NET/IB : No device found.
10-7-6-170:2174:2174 [3] NCCL INFO NET/IB : No device found.
10-7-6-170:2173:2173 [2] NCCL INFO NET/IB : No device found.
10-7-6-170:2176:2176 [5] NCCL INFO NET/IB : No device found.
10-7-6-170:2178:2178 [7] NCCL INFO NET/IB : No device found.
10-7-6-170:2175:2175 [4] NCCL INFO NET/IB : No device found.
10-7-6-170:2177:2177 [6] NCCL INFO NET/IB : No device found.
10-7-6-170:2174:2174 [3] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2172:2172 [1] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2173:2173 [2] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2177:2177 [6] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2176:2176 [5] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2178:2178 [7] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2175:2175 [4] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.170<0>
10-7-6-170:2171:2230 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff
10-7-6-170:2174:2231 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff
10-7-6-170:2178:2235 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff
10-7-6-170:2173:2233 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff
10-7-6-170:2175:2237 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff
10-7-6-170:2177:2236 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff
10-7-6-170:2172:2232 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff
10-7-6-170:2176:2234 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff
10-7-6-170:2171:2230 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance : PHB
10-7-6-170:2172:2232 [1] NCCL INFO CUDA Dev 1[1], Socket NIC distance : PHB
10-7-6-170:2173:2233 [2] NCCL INFO CUDA Dev 2[2], Socket NIC distance : PHB
10-7-6-170:2174:2231 [3] NCCL INFO CUDA Dev 3[3], Socket NIC distance : PHB
10-7-6-170:2175:2237 [4] NCCL INFO CUDA Dev 4[4], Socket NIC distance : PHB
10-7-6-170:2176:2234 [5] NCCL INFO CUDA Dev 5[5], Socket NIC distance : PHB
10-7-6-170:2177:2236 [6] NCCL INFO CUDA Dev 6[6], Socket NIC distance : PHB
10-7-6-170:2178:2235 [7] NCCL INFO CUDA Dev 7[7], Socket NIC distance : PHB
10-7-6-170:2171:2230 [0] NCCL INFO Channel 00 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
10-7-6-170:2171:2230 [0] NCCL INFO Channel 01 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
10-7-6-170:2176:2234 [5] NCCL INFO Ring 00 : 5[5] -> 6[6] via P2P/IPC
10-7-6-170:2174:2231 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
10-7-6-170:2177:2236 [6] NCCL INFO Ring 00 : 6[6] -> 7[7] via P2P/IPC
10-7-6-170:2175:2237 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
10-7-6-170:2172:2232 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
10-7-6-170:2173:2233 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
10-7-6-170:2171:2230 [0] NCCL INFO Ring 00 : 15 -> 0 [receive] via NET/Socket/0
10-7-6-170:2171:2230 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
10-7-6-170:2171:2230 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
10-7-6-170:2178:2235 [7] NCCL INFO Ring 00 : 7 -> 8 [send] via NET/Socket/0
10-7-6-170:2176:2234 [5] NCCL INFO Ring 00 : 5[5] -> 4[4] via P2P/IPC
10-7-6-170:2174:2231 [3] NCCL INFO Ring 00 : 3[3] -> 2[2] via P2P/IPC
10-7-6-170:2177:2236 [6] NCCL INFO Ring 00 : 6[6] -> 5[5] via P2P/IPC
10-7-6-170:2175:2237 [4] NCCL INFO Ring 00 : 4[4] -> 3[3] via P2P/IPC
10-7-6-170:2172:2232 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via P2P/IPC
10-7-6-170:2178:2235 [7] NCCL INFO Ring 00 : 7[7] -> 6[6] via P2P/IPC
10-7-6-170:2173:2233 [2] NCCL INFO Ring 00 : 2[2] -> 1[1] via P2P/IPC
10-7-6-170:2171:2230 [0] NCCL INFO Ring 00 : 8 -> 0 [receive] via NET/Socket/0
10-7-6-170:2171:2230 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
10-7-6-170:2177:2236 [6] NCCL INFO Ring 01 : 6[6] -> 7[7] via P2P/IPC
10-7-6-170:2176:2234 [5] NCCL INFO Ring 01 : 5[5] -> 6[6] via P2P/IPC
10-7-6-170:2174:2231 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
10-7-6-170:2175:2237 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
10-7-6-170:2172:2232 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
10-7-6-170:2173:2233 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
10-7-6-170:2178:2235 [7] NCCL INFO Ring 01 : 7 -> 8 [send] via NET/Socket/0
10-7-6-170:2171:2230 [0] NCCL INFO Ring 00 : 0 -> 8 [send] via NET/Socket/0
10-7-6-170:2177:2236 [6] NCCL INFO Ring 01 : 6[6] -> 5[5] via P2P/IPC
10-7-6-170:2176:2234 [5] NCCL INFO Ring 01 : 5[5] -> 4[4] via P2P/IPC
10-7-6-170:2174:2231 [3] NCCL INFO Ring 01 : 3[3] -> 2[2] via P2P/IPC
10-7-6-170:2175:2237 [4] NCCL INFO Ring 01 : 4[4] -> 3[3] via P2P/IPC
10-7-6-170:2173:2233 [2] NCCL INFO Ring 01 : 2[2] -> 1[1] via P2P/IPC
10-7-6-170:2176:2234 [5] NCCL INFO Trees [0] 4->5->6/-1/-1 [1] 4->5->6/-1/-1
10-7-6-170:2174:2231 [3] NCCL INFO Trees [0] 2->3->4/-1/-1 [1] 2->3->4/-1/-1
10-7-6-170:2175:2237 [4] NCCL INFO Trees [0] 3->4->5/-1/-1 [1] 3->4->5/-1/-1
10-7-6-170:2176:2234 [5] NCCL INFO comm 0x7fdcf4002540 rank 5 nranks 16 cudaDev 5 nvmlDev 5 - Init COMPLETE
10-7-6-170:2174:2231 [3] NCCL INFO comm 0x7f8280002540 rank 3 nranks 16 cudaDev 3 nvmlDev 3 - Init COMPLETE
10-7-6-170:2175:2237 [4] NCCL INFO comm 0x7f432c002540 rank 4 nranks 16 cudaDev 4 nvmlDev 4 - Init COMPLETE
10-7-6-170:2171:2230 [0] NCCL INFO Ring 01 : 15 -> 0 [receive] via NET/Socket/0
10-7-6-170:2171:2230 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
10-7-6-170:2171:2230 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
10-7-6-170:2172:2232 [1] NCCL INFO Ring 01 : 1[1] -> 0[0] via P2P/IPC
10-7-6-170:2178:2235 [7] NCCL INFO Ring 01 : 7[7] -> 6[6] via P2P/IPC
10-7-6-170:2177:2236 [6] NCCL INFO Trees [0] 5->6->7/-1/-1 [1] 5->6->7/-1/-1
10-7-6-170:2173:2233 [2] NCCL INFO Trees [0] 1->2->3/-1/-1 [1] 1->2->3/-1/-1
10-7-6-170:2178:2235 [7] NCCL INFO Trees [0] 6->7->-1/-1/-1 [1] 6->7->-1/-1/-1
10-7-6-170:2172:2232 [1] NCCL INFO Trees [0] 0->1->2/-1/-1 [1] 0->1->2/-1/-1
10-7-6-170:2171:2230 [0] NCCL INFO Ring 01 : 0 -> 8 [send] via NET/Socket/0
10-7-6-170:2177:2236 [6] NCCL INFO comm 0x7f166c002540 rank 6 nranks 16 cudaDev 6 nvmlDev 6 - Init COMPLETE
10-7-6-170:2173:2233 [2] NCCL INFO comm 0x7f934c002540 rank 2 nranks 16 cudaDev 2 nvmlDev 2 - Init COMPLETE
10-7-6-170:2178:2235 [7] NCCL INFO comm 0x7f7abc002540 rank 7 nranks 16 cudaDev 7 nvmlDev 7 - Init COMPLETE
10-7-6-170:2172:2232 [1] NCCL INFO comm 0x7f9d88002540 rank 1 nranks 16 cudaDev 1 nvmlDev 1 - Init COMPLETE
10-7-6-170:2171:2230 [0] NCCL INFO Ring 01 : 8 -> 0 [receive] via NET/Socket/0
10-7-6-170:2171:2230 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
10-7-6-170:2171:2230 [0] NCCL INFO Trees [0] -1->0->1/8/-1 [1] 8->0->1/-1/-1
10-7-6-170:2171:2230 [0] NCCL INFO Using 128 threads, Min Comp Cap 3, Trees enabled up to size 469999
10-7-6-170:2171:2230 [0] NCCL INFO comm 0x7fd6d8002540 rank 0 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE
10-7-6-170:2171:2171 [0] NCCL INFO Launch mode Parallel
2020-08-12 13:52:23 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, arch='transformer_iwslt_de_en', attention_dropout=0.0, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='data-bin/iwslt14.tokenized.de-en', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=4, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=1024, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=6, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='env://', distributed_no_spawn=True, distributed_port=-1, distributed_rank=0, distributed_world_size=16, distributed_wrapper='DDP', dropout=0.3, empty_cache_freq=0, encoder_attention_heads=4, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=1024, encoder_layerdrop=0, encoder_layers=6, encoder_layers_to_keep=None, encoder_learned_pos=False, encoder_normalize_before=False, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=None, eval_tokenized_bleu=False, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, layernorm_embedding=False, left_pad_source='True', left_pad_target='False', load_alignments=False, localsgd_frequency=3, log_format=None, log_interval=100, lr=[0.0005], lr_scheduler='inverse_sqrt', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=8000, max_tokens_valid=8000, max_update=0, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, model_parallel_size=1, no_cross_attention=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_scale_embedding=False, no_seed_provided=True, no_token_positional_embeddings=False, nprocs_per_node=8, num_batch_buckets=0, num_workers=1, optimizer='adam', optimizer_overrides='{}', patience=-1, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints', save_interval=1, save_interval_updates=0, scoring='bleu', seed=1, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, source_lang=None, stop_time_hours=0, target_lang=None, task='translation', tensorboard_logdir='', threshold_loss_scale=None, tie_adaptive_weights=False, tokenizer=None, tpu=False, train_subset='train', truncate_source=False, update_freq=[1], upsample_primary=1, use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, warmup_init_lr=-1, warmup_updates=4000, weight_decay=0.0)
2020-08-12 13:52:23 | INFO | fairseq.tasks.translation | [de] dictionary: 8848 types
2020-08-12 13:52:23 | INFO | fairseq.tasks.translation | [en] dictionary: 6632 types
2020-08-12 13:52:23 | INFO | fairseq.data.data_utils | loaded 7283 examples from: data-bin/iwslt14.tokenized.de-en/valid.de-en.de
2020-08-12 13:52:23 | INFO | fairseq.data.data_utils | loaded 7283 examples from: data-bin/iwslt14.tokenized.de-en/valid.de-en.en
2020-08-12 13:52:23 | INFO | fairseq.tasks.translation | data-bin/iwslt14.tokenized.de-en valid de-en 7283 examples
2020-08-12 13:52:24 | INFO | fairseq_cli.train | model transformer_iwslt_de_en, criterion LabelSmoothedCrossEntropyCriterion
2020-08-12 13:52:24 | INFO | fairseq_cli.train | num. model params: 42864640 (num. trained: 42864640)
2020-08-12 13:52:24 | INFO | fairseq.utils | ***********************CUDA enviroments for all 16 workers***********************
2020-08-12 13:52:24 | INFO | fairseq.utils | rank 0: capabilities = 3.7 ; total memory = 11.173 GB ; name = Tesla K80
2020-08-12 13:52:24 | INFO | fairseq.utils | rank 1: capabilities = 3.7 ; total memory = 11.173 GB ; name = Tesla K80
2020-08-12 13:52:24 | INFO | fairseq.utils | rank 2: capabilities = 3.7 ; total memory = 11.173 GB ; name = Tesla K80
2020-08-12 13:52:24 | INFO | fairseq.utils | rank 3: capabilities = 3.7 ; total memory = 11.173 GB ; name = Tesla K80
2020-08-12 13:52:24 | INFO | fairseq.utils | rank 4: capabilities = 3.7 ; total memory = 11.173 GB ; name = Tesla K80
2020-08-12 13:52:24 | INFO | fairseq.utils | rank 5: capabilities = 3.7 ; total memory = 11.173 GB ; name = Tesla K80
2020-08-12 13:52:24 | INFO | fairseq.utils | rank 6: capabilities = 3.7 ; total memory = 11.173 GB ; name = Tesla K80
2020-08-12 13:52:24 | INFO | fairseq.utils | rank 7: capabilities = 3.7 ; total memory = 11.173 GB ; name = Tesla K80
2020-08-12 13:52:24 | INFO | fairseq.utils | rank 8: capabilities = 3.7 ; total memory = 11.173 GB ; name = Tesla K80
2020-08-12 13:52:24 | INFO | fairseq.utils | rank 9: capabilities = 3.7 ; total memory = 11.173 GB ; name = Tesla K80
2020-08-12 13:52:24 | INFO | fairseq.utils | rank 10: capabilities = 3.7 ; total memory = 11.173 GB ; name = Tesla K80
2020-08-12 13:52:24 | INFO | fairseq.utils | rank 11: capabilities = 3.7 ; total memory = 11.173 GB ; name = Tesla K80
2020-08-12 13:52:24 | INFO | fairseq.utils | rank 12: capabilities = 3.7 ; total memory = 11.173 GB ; name = Tesla K80
2020-08-12 13:52:24 | INFO | fairseq.utils | rank 13: capabilities = 3.7 ; total memory = 11.173 GB ; name = Tesla K80
2020-08-12 13:52:24 | INFO | fairseq.utils | rank 14: capabilities = 3.7 ; total memory = 11.173 GB ; name = Tesla K80
2020-08-12 13:52:24 | INFO | fairseq.utils | rank 15: capabilities = 3.7 ; total memory = 11.173 GB ; name = Tesla K80
2020-08-12 13:52:24 | INFO | fairseq.utils | ***********************CUDA enviroments for all 16 workers***********************
2020-08-12 13:52:24 | INFO | fairseq_cli.train | training on 16 devices (GPUs/TPUs)
2020-08-12 13:52:24 | INFO | fairseq_cli.train | max tokens per GPU = 8000 and max sentences per GPU = None
2020-08-12 13:52:24 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/checkpoint_last.pt
2020-08-12 13:52:24 | INFO | fairseq.trainer | loading train data for epoch 1
2020-08-12 13:52:24 | INFO | fairseq.data.data_utils | loaded 160239 examples from: data-bin/iwslt14.tokenized.de-en/train.de-en.de
2020-08-12 13:52:24 | INFO | fairseq.data.data_utils | loaded 160239 examples from: data-bin/iwslt14.tokenized.de-en/train.de-en.en
2020-08-12 13:52:24 | INFO | fairseq.tasks.translation | data-bin/iwslt14.tokenized.de-en train de-en 160239 examples
2020-08-12 13:52:25 | INFO | fairseq.optim.adam | using FusedAdam
2020-08-12 13:52:25 | INFO | fairseq_cli.train | done training in 0.0 seconds
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
On the slave node, we are getting the following logs
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | distributed init (rank 9): env://
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | initialized host 10-7-6-166.cactuslabs.io as rank 9
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | distributed init (rank 15): env://
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | initialized host 10-7-6-166.cactuslabs.io as rank 15
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | distributed init (rank 14): env://
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | distributed init (rank 8): env://
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | initialized host 10-7-6-166.cactuslabs.io as rank 14
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | distributed init (rank 13): env://
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | initialized host 10-7-6-166.cactuslabs.io as rank 8
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | distributed init (rank 10): env://
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | initialized host 10-7-6-166.cactuslabs.io as rank 13
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | distributed init (rank 12): env://
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | initialized host 10-7-6-166.cactuslabs.io as rank 10
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | distributed init (rank 11): env://
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | initialized host 10-7-6-166.cactuslabs.io as rank 12
2020-08-12 13:52:20 | INFO | fairseq.distributed_utils | initialized host 10-7-6-166.cactuslabs.io as rank 11
10-7-6-166:2407:2407 [4] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2407:2407 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-166:2407:2407 [4] NCCL INFO NET/IB : No device found.
10-7-6-166:2407:2407 [4] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2404:2404 [1] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2404:2404 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-166:2409:2409 [6] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2409:2409 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-166:2409:2409 [6] NCCL INFO NET/IB : No device found.
10-7-6-166:2404:2404 [1] NCCL INFO NET/IB : No device found.
10-7-6-166:2409:2409 [6] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2404:2404 [1] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2410:2410 [7] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2410:2410 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-166:2410:2410 [7] NCCL INFO NET/IB : No device found.
10-7-6-166:2410:2410 [7] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2406:2406 [3] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2406:2406 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-166:2406:2406 [3] NCCL INFO NET/IB : No device found.
10-7-6-166:2406:2406 [3] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2405:2405 [2] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2405:2405 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-166:2405:2405 [2] NCCL INFO NET/IB : No device found.
10-7-6-166:2405:2405 [2] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2408:2408 [5] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2408:2408 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-166:2408:2408 [5] NCCL INFO NET/IB : No device found.
10-7-6-166:2408:2408 [5] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2403:2403 [0] NCCL INFO Bootstrap : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2403:2403 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
10-7-6-166:2403:2403 [0] NCCL INFO NET/IB : No device found.
10-7-6-166:2403:2403 [0] NCCL INFO NET/Socket : Using [0]ens3:10.7.6.166<0>
10-7-6-166:2404:2463 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff
10-7-6-166:2410:2465 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff
10-7-6-166:2407:2462 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff
10-7-6-166:2409:2464 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff
10-7-6-166:2406:2466 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff
10-7-6-166:2405:2467 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff
10-7-6-166:2408:2468 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff
10-7-6-166:2403:2469 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff
10-7-6-166:2410:2465 [7] NCCL INFO CUDA Dev 7[7], Socket NIC distance : PHB
10-7-6-166:2403:2469 [0] NCCL INFO CUDA Dev 0[0], Socket NIC distance : PHB
10-7-6-166:2404:2463 [1] NCCL INFO CUDA Dev 1[1], Socket NIC distance : PHB
10-7-6-166:2405:2467 [2] NCCL INFO CUDA Dev 2[2], Socket NIC distance : PHB
10-7-6-166:2408:2468 [5] NCCL INFO CUDA Dev 5[5], Socket NIC distance : PHB
10-7-6-166:2406:2466 [3] NCCL INFO CUDA Dev 3[3], Socket NIC distance : PHB
10-7-6-166:2407:2462 [4] NCCL INFO CUDA Dev 4[4], Socket NIC distance : PHB
10-7-6-166:2409:2464 [6] NCCL INFO CUDA Dev 6[6], Socket NIC distance : PHB
10-7-6-166:2408:2468 [5] NCCL INFO Ring 00 : 13[5] -> 14[6] via P2P/IPC
10-7-6-166:2407:2462 [4] NCCL INFO Ring 00 : 12[4] -> 13[5] via P2P/IPC
10-7-6-166:2406:2466 [3] NCCL INFO Ring 00 : 11[3] -> 12[4] via P2P/IPC
10-7-6-166:2405:2467 [2] NCCL INFO Ring 00 : 10[2] -> 11[3] via P2P/IPC
10-7-6-166:2409:2464 [6] NCCL INFO Ring 00 : 14[6] -> 15[7] via P2P/IPC
10-7-6-166:2404:2463 [1] NCCL INFO Ring 00 : 9[1] -> 10[2] via P2P/IPC
10-7-6-166:2403:2469 [0] NCCL INFO Ring 00 : 7 -> 8 [receive] via NET/Socket/0
10-7-6-166:2403:2469 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
10-7-6-166:2410:2465 [7] NCCL INFO Ring 00 : 15 -> 0 [send] via NET/Socket/0
10-7-6-166:2403:2469 [0] NCCL INFO Ring 00 : 8[0] -> 9[1] via P2P/IPC
10-7-6-166:2407:2462 [4] NCCL INFO Ring 00 : 12[4] -> 11[3] via P2P/IPC
10-7-6-166:2410:2465 [7] NCCL INFO Ring 00 : 15[7] -> 14[6] via P2P/IPC
10-7-6-166:2408:2468 [5] NCCL INFO Ring 00 : 13[5] -> 12[4] via P2P/IPC
10-7-6-166:2406:2466 [3] NCCL INFO Ring 00 : 11[3] -> 10[2] via P2P/IPC
10-7-6-166:2405:2467 [2] NCCL INFO Ring 00 : 10[2] -> 9[1] via P2P/IPC
10-7-6-166:2409:2464 [6] NCCL INFO Ring 00 : 14[6] -> 13[5] via P2P/IPC
10-7-6-166:2404:2463 [1] NCCL INFO Ring 00 : 9[1] -> 8[0] via P2P/IPC
10-7-6-166:2403:2469 [0] NCCL INFO Ring 00 : 8 -> 0 [send] via NET/Socket/0
10-7-6-166:2407:2462 [4] NCCL INFO Ring 01 : 12[4] -> 13[5] via P2P/IPC
10-7-6-166:2408:2468 [5] NCCL INFO Ring 01 : 13[5] -> 14[6] via P2P/IPC
10-7-6-166:2406:2466 [3] NCCL INFO Ring 01 : 11[3] -> 12[4] via P2P/IPC
10-7-6-166:2405:2467 [2] NCCL INFO Ring 01 : 10[2] -> 11[3] via P2P/IPC
10-7-6-166:2409:2464 [6] NCCL INFO Ring 01 : 14[6] -> 15[7] via P2P/IPC
10-7-6-166:2404:2463 [1] NCCL INFO Ring 01 : 9[1] -> 10[2] via P2P/IPC
10-7-6-166:2410:2465 [7] NCCL INFO Ring 01 : 15 -> 0 [send] via NET/Socket/0
10-7-6-166:2407:2462 [4] NCCL INFO Ring 01 : 12[4] -> 11[3] via P2P/IPC
10-7-6-166:2406:2466 [3] NCCL INFO Ring 01 : 11[3] -> 10[2] via P2P/IPC
10-7-6-166:2405:2467 [2] NCCL INFO Ring 01 : 10[2] -> 9[1] via P2P/IPC
10-7-6-166:2408:2468 [5] NCCL INFO Ring 01 : 13[5] -> 12[4] via P2P/IPC
10-7-6-166:2407:2462 [4] NCCL INFO Trees [0] 11->12->13/-1/-1 [1] 11->12->13/-1/-1
10-7-6-166:2403:2469 [0] NCCL INFO Ring 00 : 0 -> 8 [receive] via NET/Socket/0
10-7-6-166:2409:2464 [6] NCCL INFO Ring 01 : 14[6] -> 13[5] via P2P/IPC
10-7-6-166:2403:2469 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
10-7-6-166:2406:2466 [3] NCCL INFO Trees [0] 10->11->12/-1/-1 [1] 10->11->12/-1/-1
10-7-6-166:2408:2468 [5] NCCL INFO Trees [0] 12->13->14/-1/-1 [1] 12->13->14/-1/-1
10-7-6-166:2407:2462 [4] NCCL INFO comm 0x7f0ab4002540 rank 12 nranks 16 cudaDev 4 nvmlDev 4 - Init COMPLETE
10-7-6-166:2406:2466 [3] NCCL INFO comm 0x7f8e80002540 rank 11 nranks 16 cudaDev 3 nvmlDev 3 - Init COMPLETE
10-7-6-166:2408:2468 [5] NCCL INFO comm 0x7f09e8002540 rank 13 nranks 16 cudaDev 5 nvmlDev 5 - Init COMPLETE
10-7-6-166:2403:2469 [0] NCCL INFO Ring 01 : 7 -> 8 [receive] via NET/Socket/0
10-7-6-166:2403:2469 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
10-7-6-166:2403:2469 [0] NCCL INFO Ring 01 : 8[0] -> 9[1] via P2P/IPC
10-7-6-166:2404:2463 [1] NCCL INFO Ring 01 : 9[1] -> 8[0] via P2P/IPC
10-7-6-166:2405:2467 [2] NCCL INFO Trees [0] 9->10->11/-1/-1 [1] 9->10->11/-1/-1
10-7-6-166:2410:2465 [7] NCCL INFO Ring 01 : 15[7] -> 14[6] via P2P/IPC
10-7-6-166:2409:2464 [6] NCCL INFO Trees [0] 13->14->15/-1/-1 [1] 13->14->15/-1/-1
10-7-6-166:2410:2465 [7] NCCL INFO Trees [0] 14->15->-1/-1/-1 [1] 14->15->-1/-1/-1
10-7-6-166:2405:2467 [2] NCCL INFO comm 0x7fbd7c002540 rank 10 nranks 16 cudaDev 2 nvmlDev 2 - Init COMPLETE
10-7-6-166:2409:2464 [6] NCCL INFO comm 0x7f4290002540 rank 14 nranks 16 cudaDev 6 nvmlDev 6 - Init COMPLETE
10-7-6-166:2410:2465 [7] NCCL INFO comm 0x7ff674002540 rank 15 nranks 16 cudaDev 7 nvmlDev 7 - Init COMPLETE
10-7-6-166:2404:2463 [1] NCCL INFO Trees [0] 8->9->10/-1/-1 [1] 8->9->10/-1/-1
10-7-6-166:2403:2469 [0] NCCL INFO Ring 01 : 0 -> 8 [receive] via NET/Socket/0
10-7-6-166:2403:2469 [0] NCCL INFO NET/Socket: Using 2 threads and 8 sockets per thread
10-7-6-166:2404:2463 [1] NCCL INFO comm 0x7fc5b4002540 rank 9 nranks 16 cudaDev 1 nvmlDev 1 - Init COMPLETE
10-7-6-166:2403:2469 [0] NCCL INFO Ring 01 : 8 -> 0 [send] via NET/Socket/0
10-7-6-166:2403:2469 [0] NCCL INFO Trees [0] 0->8->9/-1/-1 [1] -1->8->9/0/-1
10-7-6-166:2403:2469 [0] NCCL INFO comm 0x7f19d4002540 rank 8 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Environment Config
- 2 nodes with 8 GPU (K80)
- Fairseq 0.9
- PyTorch 1.5
- Cuda = 9.2.88
- CudaNN = 7.6.4
- NCCL = 2.4.8
Note: We can able to run training with Apex on a single instance with multiple GPUs. Is there anything we are missing?
Any information and help will be useful.
Thanks in advance