Why is the reasoning speed of my libtorch1.9+cuda10.2 faster than that of libtorch1.9+cuda11.1?

Why is the reasoning speed of my libtorch1.9+cuda10.2 faster than that of libtorch1.9+cuda11.1?

surroundings:
Windows 10: 19042.1052

cuda11.1+libtorch1.9

test start ...
[W TensorImpl.h:1156] Warning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (function operator ())
pred takes : 6394 ms
ng! image_score = 0.0697295 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\000.png
pred takes : 17 ms
ng! image_score = 0.119712 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\001.png
pred takes : 16 ms
ng! image_score = 0.0864281 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\002.png
pred takes : 15 ms
ng! image_score = 0.130975 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\003.png
pred takes : 17 ms
ng! image_score = 0.230995 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\004.png
pred takes : 16 ms
ng! image_score = 0.0709326 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\005.png
pred takes : 16 ms
ng! image_score = 0.129408 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\006.png
pred takes : 16 ms
ng! image_score = 0.0994181 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\007.png
pred takes : 15 ms
ng! image_score = 0.16893 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\008.png
pred takes : 16 ms
ng! image_score = 0.138191 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\009.png
pred takes : 16 ms
ng! image_score = 0.0831397 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\010.png
pred takes : 18 ms
ng! image_score = 0.0994464 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\011.png
pred takes : 14 ms
ng! image_score = 0.116436 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\012.png
pred takes : 16 ms
ng! image_score = 0.063642 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\013.png
pred takes : 14 ms
ng! image_score = 0.272481 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\014.png
pred takes : 13 ms
ng! image_score = 0.164345 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\015.png
pred takes : 16 ms
ng! image_score = 0.24891 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\016.png
pred takes : 17 ms
ng! image_score = 0.103657 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\017.png
pred takes : 13 ms
ng! image_score = 0.16008 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\018.png
pred takes : 15 ms
ng! image_score = 0.0829229 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\019.png
good images number = 0
ng images number = 20
done.

cuda10.2+libtorch1.9

test start ...
[W TensorImpl.h:1156] Warning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (function operator ())
pred takes : 1390 ms
ng! image_score = 0.0697295 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\000.png
pred takes : 15 ms
ng! image_score = 0.119712 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\001.png
pred takes : 13 ms
ng! image_score = 0.086428 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\002.png
pred takes : 14 ms
ng! image_score = 0.130975 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\003.png
pred takes : 13 ms
ng! image_score = 0.230995 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\004.png
pred takes : 14 ms
ng! image_score = 0.0709326 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\005.png
pred takes : 12 ms
ng! image_score = 0.129408 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\006.png
pred takes : 13 ms
ng! image_score = 0.0994181 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\007.png
pred takes : 13 ms
ng! image_score = 0.16893 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\008.png
pred takes : 13 ms
ng! image_score = 0.138191 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\009.png
pred takes : 12 ms
ng! image_score = 0.0831397 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\010.png
pred takes : 13 ms
ng! image_score = 0.0994462 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\011.png
pred takes : 13 ms
ng! image_score = 0.116436 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\012.png
pred takes : 12 ms
ng! image_score = 0.0636419 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\013.png
pred takes : 14 ms
ng! image_score = 0.272481 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\014.png
pred takes : 12 ms
ng! image_score = 0.164345 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\015.png
pred takes : 12 ms
ng! image_score = 0.24891 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\016.png
pred takes : 14 ms
ng! image_score = 0.103657 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\017.png
pred takes : 14 ms
ng! image_score = 0.16008 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\018.png
pred takes : 13 ms
ng! image_score = 0.0829229 thre = 0.0186527
F:\workspace\DataSet\open_source\MVtec_AD\bottle\test\broken_large\dst\\019.png
good images number = 0
ng images number = 20
done.

Depending on the model it could be a regression in e.g. cublas, cudnn, or another library.
You could profile the code as described here or using the PyTorch profiler to try to isolate the bottlenecks.

What you provide is the check method of python version, how to solve it in libtorch based on C++? Are there any examples? Thanks.

Nsight Systems can profile any application so your libtorch app should also work.

Okay, let me try. Thank you

https://drive.google.com/file/d/1GiaOK7xxBaj4TdoAvt19GGXOyM8DlDnk/view?usp=sharing
This is the result of my experiment, please help me to find out what went wrong. thank you very much.