this is a problem that has confused me many days…I’m very eager for someone to discuss with me.
I have constructed a siamese model (two models with same structure Resnet18 and weight is shared between them) for classification, i used fixed random seed for reproducibility
First, i want to validate my implementation, so i fixed one of the two models (not update), and train another model, but i found the training is easy to collapse… loss NAN encountered.
But i don’t know why…
I used batch size = 2048, lr 0.1, input samples: 870672 pairs (the input of siamese network is image pairs), used 3 dataset for verification. the target classes number is 10575. two softmax loss are used in each model. (But in the following training i only used one loss, because the other is fixed for validation)
when i reduce the batch size, the training process returned to normal
when i don’t fix the random seed, the training returned to normal
when i don’t use siamese structure, only train one model as baseline, the training returned to normal…
But the core question is, why… i’m pretty confused about this…
the code i used to fix random seed:
################## for reproducibility #####################
import random
import numpy as np
GLOBAL_SEED = 2 # seed = 1, collapse at epoch 2, iter 1200, seed = 2, c @ e 3,i 1300
GLOBAL_WORKER_ID = None
torch.backends.cudnn.benchmark = False # if benchmark=True, deterministic will be False
torch.backends.cudnn.deterministic = True
def set_seed(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
def worker_init_fn(worker_id):
global GLOBAL_WORKER_ID
GLOBAL_WORKER_ID = worker_id
set_seed(GLOBAL_SEED + worker_id)
set_seed(GLOBAL_SEED)
################## for reproducibility #####################
the error report:
(the 5 float number in each line are mean backward gradients of ‘module.layer1.0.conv1.weight’, ‘module.layer2.0.conv1.weight’, ‘module.layer3.0.conv1.weight’, ‘module.layer4.0.conv1.weight’, last layer’s loss gradient)
Training: 2021-04-29 09:31:37,519-backbone1 50 -0.0000137358 -0.0000043800 -0.0000008597 -0.0000000811 -0.0001298849
Training: 2021-04-29 09:32:06,250-backbone1 100 0.0000073106 -0.0000006519 -0.0000003127 0.0000000286 -0.0000465577
Training: 2021-04-29 09:32:06,260-Speed 3564.05 samples/sec Loss1 49.1559 Loss2 0.0000 Epoch: 0 Global Step: 100 Required: 2 hours
Training: 2021-04-29 09:32:35,429-backbone1 150 0.0000032515 -0.0000003846 -0.0000003430 0.0000000969 0.0000119228
Training: 2021-04-29 09:32:35,439-Speed 3509.50 samples/sec Loss1 41.0964 Loss2 0.0000 Epoch: 0 Global Step: 150 Required: 2 hours
Training: 2021-04-29 09:33:04,455-backbone1 200 0.0000059055 -0.0000007602 -0.0000031270 -0.0000003199 0.0000291518
Training: 2021-04-29 09:33:04,466-Speed 3527.79 samples/sec Loss1 40.2287 Loss2 0.0000 Epoch: 0 Global Step: 200 Required: 2 hours
Training: 2021-04-29 09:33:33,504-backbone1 250 0.0000039045 -0.0000049022 -0.0000051885 -0.0000007093 0.0000097477
Training: 2021-04-29 09:33:33,515-Speed 3525.17 samples/sec Loss1 39.6313 Loss2 0.0000 Epoch: 0 Global Step: 250 Required: 2 hours
Training: 2021-04-29 09:34:02,602-backbone1 300 -0.0000067733 0.0000056458 0.0000001776 -0.0000001335 -0.0000031163
Training: 2021-04-29 09:34:02,613-Speed 3519.13 samples/sec Loss1 39.0091 Loss2 0.0000 Epoch: 0 Global Step: 300 Required: 2 hours
Training: 2021-04-29 09:34:32,103-backbone1 350 -0.0000052458 -0.0000033983 0.0000027793 0.0000003088 0.0000327631
Training: 2021-04-29 09:34:32,114-Speed 3471.10 samples/sec Loss1 38.2565 Loss2 0.0000 Epoch: 0 Global Step: 350 Required: 2 hours
Training: 2021-04-29 09:35:01,292-backbone1 400 0.0000029013 -0.0000019848 -0.0000013949 0.0000003473 0.0000103362
Training: 2021-04-29 09:35:01,302-Speed 3508.34 samples/sec Loss1 37.2425 Loss2 0.0000 Epoch: 0 Global Step: 400 Required: 2 hours
testing verification..
(12000, 512)
infer time 19.819692000000018
Training: 2021-04-29 09:35:23,950-[lfw][400]XNorm: 28.534864
Training: 2021-04-29 09:35:23,950-[lfw][400]Accuracy-Flip: 0.75733+-0.01886
Training: 2021-04-29 09:35:23,950-[lfw][400]Accuracy-Highest: 0.75733
testing verification..
(14000, 512)
infer time 22.327677000000026
Training: 2021-04-29 09:35:49,414-[cfp_fp][400]XNorm: 32.921188
Training: 2021-04-29 09:35:49,414-[cfp_fp][400]Accuracy-Flip: 0.57457+-0.01552
Training: 2021-04-29 09:35:49,414-[cfp_fp][400]Accuracy-Highest: 0.57457
testing verification..
(12000, 512)
infer time 19.14271100000001
Training: 2021-04-29 09:36:11,377-[agedb_30][400]XNorm: 28.411715
Training: 2021-04-29 09:36:11,378-[agedb_30][400]Accuracy-Flip: 0.50517+-0.00825
Training: 2021-04-29 09:36:11,378-[agedb_30][400]Accuracy-Highest: 0.50517
Training: 2021-04-29 09:36:40,842-backbone1 450 0.0000034773 0.0000083096 0.0000005531 0.0000002907 -0.0000192381
Training: 2021-04-29 09:36:40,852-Speed 1028.64 samples/sec Loss1 35.9364 Loss2 0.0000 Epoch: 1 Global Step: 450 Required: 3 hours
Training: 2021-04-29 09:37:09,992-backbone1 500 -0.0000086446 -0.0000030455 0.0000015252 -0.0000009439 -0.0000184080
Training: 2021-04-29 09:37:10,002-Speed 3512.90 samples/sec Loss1 34.4943 Loss2 0.0000 Epoch: 1 Global Step: 500 Required: 2 hours
Training: 2021-04-29 09:37:39,172-backbone1 550 0.0000060328 0.0000011825 -0.0000021444 0.0000010476 0.0000390993
Training: 2021-04-29 09:37:39,182-Speed 3509.24 samples/sec Loss1 32.9047 Loss2 0.0000 Epoch: 1 Global Step: 550 Required: 2 hours
Training: 2021-04-29 09:38:08,557-backbone1 600 0.0000096855 -0.0000213673 -0.0000070799 0.0000001221 -0.0000334739
Training: 2021-04-29 09:38:08,568-Speed 3484.76 samples/sec Loss1 31.2441 Loss2 0.0000 Epoch: 1 Global Step: 600 Required: 2 hours
Training: 2021-04-29 09:38:37,834-backbone1 650 -0.0000215237 -0.0000134368 -0.0000043527 -0.0000016915 0.0000000888
Training: 2021-04-29 09:38:37,845-Speed 3497.65 samples/sec Loss1 29.5965 Loss2 0.0000 Epoch: 1 Global Step: 650 Required: 2 hours
Training: 2021-04-29 09:39:07,422-backbone1 700 -0.0000863088 0.0000031217 -0.0000091112 -0.0000066638 0.0000530388
Training: 2021-04-29 09:39:07,433-Speed 3460.90 samples/sec Loss1 28.0384 Loss2 0.0000 Epoch: 1 Global Step: 700 Required: 2 hours
Training: 2021-04-29 09:39:37,182-backbone1 750 -0.0000304818 0.0000089085 -0.0000000184 -0.0000025467 0.0000559710
Training: 2021-04-29 09:39:37,192-Speed 3440.95 samples/sec Loss1 26.7838 Loss2 0.0000 Epoch: 1 Global Step: 750 Required: 2 hours
Training: 2021-04-29 09:40:06,530-backbone1 800 0.0000194513 -0.0000321974 -0.0000130659 -0.0000101980 0.0000749629
Training: 2021-04-29 09:40:06,541-Speed 3489.08 samples/sec Loss1 25.5273 Loss2 0.0000 Epoch: 1 Global Step: 800 Required: 2 hours
testing verification..
(12000, 512)
infer time 19.298425999999996
Training: 2021-04-29 09:40:28,642-[lfw][800]XNorm: 20.632740
Training: 2021-04-29 09:40:28,642-[lfw][800]Accuracy-Flip: 0.78467+-0.02390
Training: 2021-04-29 09:40:28,642-[lfw][800]Accuracy-Highest: 0.78467
testing verification..
(14000, 512)
infer time 22.323161000000002
Training: 2021-04-29 09:40:54,134-[cfp_fp][800]XNorm: 23.564507
Training: 2021-04-29 09:40:54,134-[cfp_fp][800]Accuracy-Flip: 0.51271+-0.01559
Training: 2021-04-29 09:40:54,135-[cfp_fp][800]Accuracy-Highest: 0.57457
testing verification..
(12000, 512)
infer time 19.17494900000001
Training: 2021-04-29 09:41:16,108-[agedb_30][800]XNorm: 20.196208
Training: 2021-04-29 09:41:16,109-[agedb_30][800]Accuracy-Flip: 0.55300+-0.02642
Training: 2021-04-29 09:41:16,109-[agedb_30][800]Accuracy-Highest: 0.55300
Training: 2021-04-29 09:41:16,111-SAVE /data/user1/log/frvt_pytorch/r18_0428_singleBranch1/backbone-800.pth
Training: 2021-04-29 09:41:45,760-backbone1 850 0.0000074828 -0.0000015774 0.0000068587 0.0000018232 -0.0000163120
Training: 2021-04-29 09:41:45,769-Speed 1031.97 samples/sec Loss1 24.5892 Loss2 0.0000 Epoch: 1 Global Step: 850 Required: 2 hours
Training: 2021-04-29 09:42:15,790-backbone1 900 -0.0000157441 -0.0000004449 -0.0000069333 -0.0000003870 -0.0000270364
Training: 2021-04-29 09:42:15,800-Speed 3409.87 samples/sec Loss1 23.4581 Loss2 0.0000 Epoch: 2 Global Step: 900 Required: 2 hours
Training: 2021-04-29 09:42:44,992-backbone1 950 0.0000218877 -0.0000229170 -0.0000062682 0.0000033801 -0.0000269744
Training: 2021-04-29 09:42:45,002-Speed 3506.60 samples/sec Loss1 22.9516 Loss2 0.0000 Epoch: 2 Global Step: 950 Required: 2 hours
Training: 2021-04-29 09:43:14,325-backbone1 1000 -0.0000338137 0.0000102696 0.0000077743 0.0000050223 -0.0000190491
Training: 2021-04-29 09:43:14,336-Speed 3490.94 samples/sec Loss1 22.2874 Loss2 0.0000 Epoch: 2 Global Step: 1000 Required: 2 hours
Training: 2021-04-29 09:43:43,777-backbone1 1050 -0.0000002878 0.0000241583 0.0000045870 0.0000014826 0.0000152171
Training: 2021-04-29 09:43:43,788-Speed 3476.86 samples/sec Loss1 21.8218 Loss2 0.0000 Epoch: 2 Global Step: 1050 Required: 2 hours
Training: 2021-04-29 09:44:13,120-backbone1 1100 0.0000183425 0.0000207572 0.0000063101 -0.0000013475 0.0000424789
Training: 2021-04-29 09:44:13,131-Speed 3489.83 samples/sec Loss1 21.2046 Loss2 0.0000 Epoch: 2 Global Step: 1100 Required: 2 hours
Training: 2021-04-29 09:44:42,763-backbone1 1150 -0.0000001899 -0.0000095896 -0.0000034797 -0.0000012843 0.0003032629
Training: 2021-04-29 09:44:42,774-Speed 3454.45 samples/sec Loss1 19.9759 Loss2 0.0000 Epoch: 2 Global Step: 1150 Required: 2 hours
Training: 2021-04-29 09:45:12,098-backbone1 1200 0.0000205074 0.0001122311 -0.0000007625 0.0000006167 0.0062720394
Training: 2021-04-29 09:45:12,109-Speed 3490.74 samples/sec Loss1 12.3644 Loss2 0.0000 Epoch: 2 Global Step: 1200 Required: 2 hours
testing verification..
(12000, 512)
infer time 19.27514199999998
Training: 2021-04-29 09:45:34,196-[lfw][1200]XNorm: 1861.459197
Training: 2021-04-29 09:45:34,196-[lfw][1200]Accuracy-Flip: 0.54150+-0.02608
Training: 2021-04-29 09:45:34,196-[lfw][1200]Accuracy-Highest: 0.78467
testing verification..
(14000, 512)
infer time 22.579255999999955
Training: 2021-04-29 09:46:00,088-[cfp_fp][1200]XNorm: 2571.969116
Training: 2021-04-29 09:46:00,088-[cfp_fp][1200]Accuracy-Flip: 0.50014+-0.00837
Training: 2021-04-29 09:46:00,088-[cfp_fp][1200]Accuracy-Highest: 0.57457
testing verification..
(12000, 512)
infer time 19.345285000000032
Training: 2021-04-29 09:46:22,217-[agedb_30][1200]XNorm: 1319.737781
Training: 2021-04-29 09:46:22,217-[agedb_30][1200]Accuracy-Flip: 0.51717+-0.02199
Training: 2021-04-29 09:46:22,217-[agedb_30][1200]Accuracy-Highest: 0.55300
Training: 2021-04-29 09:46:22,221-SAVE /data/user1/log/frvt_pytorch/r18_0428_singleBranch1/backbone-1200.pth
Training: 2021-04-29 09:46:50,760-backbone1 1250 -0.0000009155 -0.0000001523 -0.0000000194 0.0000000129 0.0117469588
Training: 2021-04-29 09:46:50,771-Speed 1037.89 samples/sec Loss1 3.3284 Loss2 0.0000 Epoch: 2 Global Step: 1250 Required: 2 hours
Training: 2021-04-29 09:47:20,391-backbone1 1300 0.0000003243 0.0000012285 -0.0000003732 0.0000021354 0.0076964526
Training: 2021-04-29 09:47:20,401-Speed 3456.02 samples/sec Loss1 3.2593 Loss2 0.0000 Epoch: 3 Global Step: 1300 Required: 2 hours
Training: 2021-04-29 09:47:49,766-backbone1 1350 nan nan nan nan nan
Training: 2021-04-29 09:47:49,777-Speed 3485.90 samples/sec Loss1 nan Loss2 0.0000 Epoch: 3 Global Step: 1350 Required: 2 hours
Training: 2021-04-29 09:48:18,106-backbone1 1400 nan nan nan nan nan
Training: 2021-04-29 09:48:18,116-Speed 3613.38 samples/sec Loss1 nan Loss2 0.0000 Epoch: 3 Global Step: 1400 Required: 2 hours
Training: 2021-04-29 09:48:46,211-backbone1 1450 nan nan nan nan nan
Training: 2021-04-29 09:48:46,221-Speed 3643.52 samples/sec Loss1 nan Loss2 0.0000 Epoch: 3 Global Step: 1450 Required: 2 hours
Training: 2021-04-29 09:49:14,437-backbone1 1500 nan nan nan nan nan
Training: 2021-04-29 09:49:14,447-Speed 3627.91 samples/sec Loss1 nan Loss2 0.0000 Epoch: 3 Global Step: 1500 Required: 2 hours
Training: 2021-04-29 09:49:42,574-backbone1 1550 nan nan nan nan nan
Training: 2021-04-29 09:49:42,584-Speed 3639.35 samples/sec Loss1 nan Loss2 0.0000 Epoch: 3 Global Step: 1550 Required: 2 hours
Training: 2021-04-29 09:50:10,686-backbone1 1600 nan nan nan nan nan
Training: 2021-04-29 09:50:10,698-Speed 3642.45 samples/sec Loss1 nan Loss2 0.0000 Epoch: 3 Global Step: 1600 Required: 2 hours
testing verification..
Traceback (most recent call last):
File "train.py", line 235, in <module>
main(args_)
File "train.py", line 218, in main
callback_verification(global_step, backbone)
File "/home/user1/pjs/frvt_pytorch/0428_bigBsRandSeedFix/2branch/recognition/arcface_torch/utils/utils_callbacks.py", line 48, in __call__
self.ver_test(backbone, num_update)
File "/home/user1/pjs/frvt_pytorch/0428_bigBsRandSeedFix/2branch/recognition/arcface_torch/utils/utils_callbacks.py", line 28, in ver_test
self.ver_list[i], backbone, 10, 10)
File "/home/user1/miniconda3/envs/py377/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/home/user1/pjs/frvt_pytorch/0428_bigBsRandSeedFix/2branch/recognition/arcface_torch/eval/verification.py", line 265, in test
embeddings = sklearn.preprocessing.normalize(embeddings)
File "/home/user1/miniconda3/envs/py377/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
return f(**kwargs)
File "/home/user1/miniconda3/envs/py377/lib/python3.7/site-packages/sklearn/preprocessing/_data.py", line 1711, in normalize
estimator='the normalize function', dtype=FLOAT_DTYPES)
File "/home/user1/miniconda3/envs/py377/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
return f(**kwargs)
File "/home/user1/miniconda3/envs/py377/lib/python3.7/site-packages/sklearn/utils/validation.py", line 645, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "/home/user1/miniconda3/envs/py377/lib/python3.7/site-packages/sklearn/utils/validation.py", line 99, in _assert_all_finite
msg_dtype if msg_dtype is not None else X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Traceback (most recent call last):
File "/home/user1/miniconda3/envs/py377/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/user1/miniconda3/envs/py377/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/user1/miniconda3/envs/py377/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/home/user1/miniconda3/envs/py377/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/user1/miniconda3/envs/py377/bin/python3', '-u', 'train.py', '--local_rank=7']' returned non-zero exit status 1.