Collapse in model training

this is a problem that has confused me many days…I’m very eager for someone to discuss with me.

I have constructed a siamese model (two models with same structure Resnet18 and weight is shared between them) for classification, i used fixed random seed for reproducibility

First, i want to validate my implementation, so i fixed one of the two models (not update), and train another model, but i found the training is easy to collapse… loss NAN encountered.

But i don’t know why…

I used batch size = 2048, lr 0.1, input samples: 870672 pairs (the input of siamese network is image pairs), used 3 dataset for verification. the target classes number is 10575. two softmax loss are used in each model. (But in the following training i only used one loss, because the other is fixed for validation)

when i reduce the batch size, the training process returned to normal
when i don’t fix the random seed, the training returned to normal
when i don’t use siamese structure, only train one model as baseline, the training returned to normal…

But the core question is, why… i’m pretty confused about this…

the code i used to fix random seed:

################## for reproducibility #####################
import random
import numpy as np

GLOBAL_SEED = 2 # seed = 1, collapse at epoch 2, iter 1200, seed = 2, c @ e 3,i 1300
GLOBAL_WORKER_ID = None
torch.backends.cudnn.benchmark = False  # if benchmark=True, deterministic will be False
torch.backends.cudnn.deterministic = True

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

def worker_init_fn(worker_id):
    global GLOBAL_WORKER_ID
    GLOBAL_WORKER_ID = worker_id
    set_seed(GLOBAL_SEED + worker_id)

set_seed(GLOBAL_SEED)

################## for reproducibility #####################

the error report:
(the 5 float number in each line are mean backward gradients of ‘module.layer1.0.conv1.weight’, ‘module.layer2.0.conv1.weight’, ‘module.layer3.0.conv1.weight’, ‘module.layer4.0.conv1.weight’, last layer’s loss gradient)

Training: 2021-04-29 09:31:37,519-backbone1	50	-0.0000137358	-0.0000043800	-0.0000008597	-0.0000000811	-0.0001298849
Training: 2021-04-29 09:32:06,250-backbone1	100	0.0000073106	-0.0000006519	-0.0000003127	0.0000000286	-0.0000465577
Training: 2021-04-29 09:32:06,260-Speed 3564.05 samples/sec   Loss1 49.1559   Loss2 0.0000   Epoch: 0   Global Step: 100   Required: 2 hours
Training: 2021-04-29 09:32:35,429-backbone1	150	0.0000032515	-0.0000003846	-0.0000003430	0.0000000969	0.0000119228
Training: 2021-04-29 09:32:35,439-Speed 3509.50 samples/sec   Loss1 41.0964   Loss2 0.0000   Epoch: 0   Global Step: 150   Required: 2 hours
Training: 2021-04-29 09:33:04,455-backbone1	200	0.0000059055	-0.0000007602	-0.0000031270	-0.0000003199	0.0000291518
Training: 2021-04-29 09:33:04,466-Speed 3527.79 samples/sec   Loss1 40.2287   Loss2 0.0000   Epoch: 0   Global Step: 200   Required: 2 hours
Training: 2021-04-29 09:33:33,504-backbone1	250	0.0000039045	-0.0000049022	-0.0000051885	-0.0000007093	0.0000097477
Training: 2021-04-29 09:33:33,515-Speed 3525.17 samples/sec   Loss1 39.6313   Loss2 0.0000   Epoch: 0   Global Step: 250   Required: 2 hours
Training: 2021-04-29 09:34:02,602-backbone1	300	-0.0000067733	0.0000056458	0.0000001776	-0.0000001335	-0.0000031163
Training: 2021-04-29 09:34:02,613-Speed 3519.13 samples/sec   Loss1 39.0091   Loss2 0.0000   Epoch: 0   Global Step: 300   Required: 2 hours
Training: 2021-04-29 09:34:32,103-backbone1	350	-0.0000052458	-0.0000033983	0.0000027793	0.0000003088	0.0000327631
Training: 2021-04-29 09:34:32,114-Speed 3471.10 samples/sec   Loss1 38.2565   Loss2 0.0000   Epoch: 0   Global Step: 350   Required: 2 hours
Training: 2021-04-29 09:35:01,292-backbone1	400	0.0000029013	-0.0000019848	-0.0000013949	0.0000003473	0.0000103362
Training: 2021-04-29 09:35:01,302-Speed 3508.34 samples/sec   Loss1 37.2425   Loss2 0.0000   Epoch: 0   Global Step: 400   Required: 2 hours
testing verification..
(12000, 512)
infer time 19.819692000000018
Training: 2021-04-29 09:35:23,950-[lfw][400]XNorm: 28.534864
Training: 2021-04-29 09:35:23,950-[lfw][400]Accuracy-Flip: 0.75733+-0.01886
Training: 2021-04-29 09:35:23,950-[lfw][400]Accuracy-Highest: 0.75733
testing verification..
(14000, 512)
infer time 22.327677000000026
Training: 2021-04-29 09:35:49,414-[cfp_fp][400]XNorm: 32.921188
Training: 2021-04-29 09:35:49,414-[cfp_fp][400]Accuracy-Flip: 0.57457+-0.01552
Training: 2021-04-29 09:35:49,414-[cfp_fp][400]Accuracy-Highest: 0.57457
testing verification..
(12000, 512)
infer time 19.14271100000001
Training: 2021-04-29 09:36:11,377-[agedb_30][400]XNorm: 28.411715
Training: 2021-04-29 09:36:11,378-[agedb_30][400]Accuracy-Flip: 0.50517+-0.00825
Training: 2021-04-29 09:36:11,378-[agedb_30][400]Accuracy-Highest: 0.50517
Training: 2021-04-29 09:36:40,842-backbone1	450	0.0000034773	0.0000083096	0.0000005531	0.0000002907	-0.0000192381
Training: 2021-04-29 09:36:40,852-Speed 1028.64 samples/sec   Loss1 35.9364   Loss2 0.0000   Epoch: 1   Global Step: 450   Required: 3 hours
Training: 2021-04-29 09:37:09,992-backbone1	500	-0.0000086446	-0.0000030455	0.0000015252	-0.0000009439	-0.0000184080
Training: 2021-04-29 09:37:10,002-Speed 3512.90 samples/sec   Loss1 34.4943   Loss2 0.0000   Epoch: 1   Global Step: 500   Required: 2 hours
Training: 2021-04-29 09:37:39,172-backbone1	550	0.0000060328	0.0000011825	-0.0000021444	0.0000010476	0.0000390993
Training: 2021-04-29 09:37:39,182-Speed 3509.24 samples/sec   Loss1 32.9047   Loss2 0.0000   Epoch: 1   Global Step: 550   Required: 2 hours
Training: 2021-04-29 09:38:08,557-backbone1	600	0.0000096855	-0.0000213673	-0.0000070799	0.0000001221	-0.0000334739
Training: 2021-04-29 09:38:08,568-Speed 3484.76 samples/sec   Loss1 31.2441   Loss2 0.0000   Epoch: 1   Global Step: 600   Required: 2 hours
Training: 2021-04-29 09:38:37,834-backbone1	650	-0.0000215237	-0.0000134368	-0.0000043527	-0.0000016915	0.0000000888
Training: 2021-04-29 09:38:37,845-Speed 3497.65 samples/sec   Loss1 29.5965   Loss2 0.0000   Epoch: 1   Global Step: 650   Required: 2 hours
Training: 2021-04-29 09:39:07,422-backbone1	700	-0.0000863088	0.0000031217	-0.0000091112	-0.0000066638	0.0000530388
Training: 2021-04-29 09:39:07,433-Speed 3460.90 samples/sec   Loss1 28.0384   Loss2 0.0000   Epoch: 1   Global Step: 700   Required: 2 hours
Training: 2021-04-29 09:39:37,182-backbone1	750	-0.0000304818	0.0000089085	-0.0000000184	-0.0000025467	0.0000559710
Training: 2021-04-29 09:39:37,192-Speed 3440.95 samples/sec   Loss1 26.7838   Loss2 0.0000   Epoch: 1   Global Step: 750   Required: 2 hours
Training: 2021-04-29 09:40:06,530-backbone1	800	0.0000194513	-0.0000321974	-0.0000130659	-0.0000101980	0.0000749629
Training: 2021-04-29 09:40:06,541-Speed 3489.08 samples/sec   Loss1 25.5273   Loss2 0.0000   Epoch: 1   Global Step: 800   Required: 2 hours
testing verification..
(12000, 512)
infer time 19.298425999999996
Training: 2021-04-29 09:40:28,642-[lfw][800]XNorm: 20.632740
Training: 2021-04-29 09:40:28,642-[lfw][800]Accuracy-Flip: 0.78467+-0.02390
Training: 2021-04-29 09:40:28,642-[lfw][800]Accuracy-Highest: 0.78467
testing verification..
(14000, 512)
infer time 22.323161000000002
Training: 2021-04-29 09:40:54,134-[cfp_fp][800]XNorm: 23.564507
Training: 2021-04-29 09:40:54,134-[cfp_fp][800]Accuracy-Flip: 0.51271+-0.01559
Training: 2021-04-29 09:40:54,135-[cfp_fp][800]Accuracy-Highest: 0.57457
testing verification..
(12000, 512)
infer time 19.17494900000001
Training: 2021-04-29 09:41:16,108-[agedb_30][800]XNorm: 20.196208
Training: 2021-04-29 09:41:16,109-[agedb_30][800]Accuracy-Flip: 0.55300+-0.02642
Training: 2021-04-29 09:41:16,109-[agedb_30][800]Accuracy-Highest: 0.55300
Training: 2021-04-29 09:41:16,111-SAVE /data/user1/log/frvt_pytorch/r18_0428_singleBranch1/backbone-800.pth
Training: 2021-04-29 09:41:45,760-backbone1	850	0.0000074828	-0.0000015774	0.0000068587	0.0000018232	-0.0000163120
Training: 2021-04-29 09:41:45,769-Speed 1031.97 samples/sec   Loss1 24.5892   Loss2 0.0000   Epoch: 1   Global Step: 850   Required: 2 hours
Training: 2021-04-29 09:42:15,790-backbone1	900	-0.0000157441	-0.0000004449	-0.0000069333	-0.0000003870	-0.0000270364
Training: 2021-04-29 09:42:15,800-Speed 3409.87 samples/sec   Loss1 23.4581   Loss2 0.0000   Epoch: 2   Global Step: 900   Required: 2 hours
Training: 2021-04-29 09:42:44,992-backbone1	950	0.0000218877	-0.0000229170	-0.0000062682	0.0000033801	-0.0000269744
Training: 2021-04-29 09:42:45,002-Speed 3506.60 samples/sec   Loss1 22.9516   Loss2 0.0000   Epoch: 2   Global Step: 950   Required: 2 hours
Training: 2021-04-29 09:43:14,325-backbone1	1000	-0.0000338137	0.0000102696	0.0000077743	0.0000050223	-0.0000190491
Training: 2021-04-29 09:43:14,336-Speed 3490.94 samples/sec   Loss1 22.2874   Loss2 0.0000   Epoch: 2   Global Step: 1000   Required: 2 hours
Training: 2021-04-29 09:43:43,777-backbone1	1050	-0.0000002878	0.0000241583	0.0000045870	0.0000014826	0.0000152171
Training: 2021-04-29 09:43:43,788-Speed 3476.86 samples/sec   Loss1 21.8218   Loss2 0.0000   Epoch: 2   Global Step: 1050   Required: 2 hours
Training: 2021-04-29 09:44:13,120-backbone1	1100	0.0000183425	0.0000207572	0.0000063101	-0.0000013475	0.0000424789
Training: 2021-04-29 09:44:13,131-Speed 3489.83 samples/sec   Loss1 21.2046   Loss2 0.0000   Epoch: 2   Global Step: 1100   Required: 2 hours
Training: 2021-04-29 09:44:42,763-backbone1	1150	-0.0000001899	-0.0000095896	-0.0000034797	-0.0000012843	0.0003032629
Training: 2021-04-29 09:44:42,774-Speed 3454.45 samples/sec   Loss1 19.9759   Loss2 0.0000   Epoch: 2   Global Step: 1150   Required: 2 hours
Training: 2021-04-29 09:45:12,098-backbone1	1200	0.0000205074	0.0001122311	-0.0000007625	0.0000006167	0.0062720394
Training: 2021-04-29 09:45:12,109-Speed 3490.74 samples/sec   Loss1 12.3644   Loss2 0.0000   Epoch: 2   Global Step: 1200   Required: 2 hours
testing verification..
(12000, 512)
infer time 19.27514199999998
Training: 2021-04-29 09:45:34,196-[lfw][1200]XNorm: 1861.459197
Training: 2021-04-29 09:45:34,196-[lfw][1200]Accuracy-Flip: 0.54150+-0.02608
Training: 2021-04-29 09:45:34,196-[lfw][1200]Accuracy-Highest: 0.78467
testing verification..
(14000, 512)
infer time 22.579255999999955
Training: 2021-04-29 09:46:00,088-[cfp_fp][1200]XNorm: 2571.969116
Training: 2021-04-29 09:46:00,088-[cfp_fp][1200]Accuracy-Flip: 0.50014+-0.00837
Training: 2021-04-29 09:46:00,088-[cfp_fp][1200]Accuracy-Highest: 0.57457
testing verification..
(12000, 512)
infer time 19.345285000000032
Training: 2021-04-29 09:46:22,217-[agedb_30][1200]XNorm: 1319.737781
Training: 2021-04-29 09:46:22,217-[agedb_30][1200]Accuracy-Flip: 0.51717+-0.02199
Training: 2021-04-29 09:46:22,217-[agedb_30][1200]Accuracy-Highest: 0.55300
Training: 2021-04-29 09:46:22,221-SAVE /data/user1/log/frvt_pytorch/r18_0428_singleBranch1/backbone-1200.pth
Training: 2021-04-29 09:46:50,760-backbone1	1250	-0.0000009155	-0.0000001523	-0.0000000194	0.0000000129	0.0117469588
Training: 2021-04-29 09:46:50,771-Speed 1037.89 samples/sec   Loss1 3.3284   Loss2 0.0000   Epoch: 2   Global Step: 1250   Required: 2 hours
Training: 2021-04-29 09:47:20,391-backbone1	1300	0.0000003243	0.0000012285	-0.0000003732	0.0000021354	0.0076964526
Training: 2021-04-29 09:47:20,401-Speed 3456.02 samples/sec   Loss1 3.2593   Loss2 0.0000   Epoch: 3   Global Step: 1300   Required: 2 hours
Training: 2021-04-29 09:47:49,766-backbone1	1350	nan	nan	nan	nan	nan
Training: 2021-04-29 09:47:49,777-Speed 3485.90 samples/sec   Loss1 nan   Loss2 0.0000   Epoch: 3   Global Step: 1350   Required: 2 hours
Training: 2021-04-29 09:48:18,106-backbone1	1400	nan	nan	nan	nan	nan
Training: 2021-04-29 09:48:18,116-Speed 3613.38 samples/sec   Loss1 nan   Loss2 0.0000   Epoch: 3   Global Step: 1400   Required: 2 hours
Training: 2021-04-29 09:48:46,211-backbone1	1450	nan	nan	nan	nan	nan
Training: 2021-04-29 09:48:46,221-Speed 3643.52 samples/sec   Loss1 nan   Loss2 0.0000   Epoch: 3   Global Step: 1450   Required: 2 hours
Training: 2021-04-29 09:49:14,437-backbone1	1500	nan	nan	nan	nan	nan
Training: 2021-04-29 09:49:14,447-Speed 3627.91 samples/sec   Loss1 nan   Loss2 0.0000   Epoch: 3   Global Step: 1500   Required: 2 hours
Training: 2021-04-29 09:49:42,574-backbone1	1550	nan	nan	nan	nan	nan
Training: 2021-04-29 09:49:42,584-Speed 3639.35 samples/sec   Loss1 nan   Loss2 0.0000   Epoch: 3   Global Step: 1550   Required: 2 hours
Training: 2021-04-29 09:50:10,686-backbone1	1600	nan	nan	nan	nan	nan
Training: 2021-04-29 09:50:10,698-Speed 3642.45 samples/sec   Loss1 nan   Loss2 0.0000   Epoch: 3   Global Step: 1600   Required: 2 hours
testing verification..
Traceback (most recent call last):
  File "train.py", line 235, in <module>
    main(args_)
  File "train.py", line 218, in main
    callback_verification(global_step, backbone)
  File "/home/user1/pjs/frvt_pytorch/0428_bigBsRandSeedFix/2branch/recognition/arcface_torch/utils/utils_callbacks.py", line 48, in __call__
    self.ver_test(backbone, num_update)
  File "/home/user1/pjs/frvt_pytorch/0428_bigBsRandSeedFix/2branch/recognition/arcface_torch/utils/utils_callbacks.py", line 28, in ver_test
    self.ver_list[i], backbone, 10, 10)
  File "/home/user1/miniconda3/envs/py377/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/home/user1/pjs/frvt_pytorch/0428_bigBsRandSeedFix/2branch/recognition/arcface_torch/eval/verification.py", line 265, in test
    embeddings = sklearn.preprocessing.normalize(embeddings)
  File "/home/user1/miniconda3/envs/py377/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "/home/user1/miniconda3/envs/py377/lib/python3.7/site-packages/sklearn/preprocessing/_data.py", line 1711, in normalize
    estimator='the normalize function', dtype=FLOAT_DTYPES)
  File "/home/user1/miniconda3/envs/py377/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "/home/user1/miniconda3/envs/py377/lib/python3.7/site-packages/sklearn/utils/validation.py", line 645, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "/home/user1/miniconda3/envs/py377/lib/python3.7/site-packages/sklearn/utils/validation.py", line 99, in _assert_all_finite
    msg_dtype if msg_dtype is not None else X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Traceback (most recent call last):
  File "/home/user1/miniconda3/envs/py377/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/user1/miniconda3/envs/py377/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/user1/miniconda3/envs/py377/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/home/user1/miniconda3/envs/py377/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/user1/miniconda3/envs/py377/bin/python3', '-u', 'train.py', '--local_rank=7']' returned non-zero exit status 1.