Hi, my CNN is quite simple 6 Conv-BN-Relu stacked network and it should run on CPU.
The main speed bottle neck is in “nonzero” op in the last.
It takes almost 90% of the whole processing time, so I need to optimize this op more.
I found torch uses OpenMP or TBB’s “parallel for” and Intel and facebook collaborated to make a faster one called IPEX-pytorch(mabye it is out of box).
Does anybody can tell me how to implement a faster NonZero op?
I tried Openvino and Onnxruntime.
- Openvino can optimize some ops with weight but it doesn’t support Nonzero op.
- Onnxruntime is almost same as torch’s nonzero op.