How to speed up Non-Zero op?

Hi, my CNN is quite simple 6 Conv-BN-Relu stacked network and it should run on CPU.

The main speed bottle neck is in “nonzero” op in the last.
It takes almost 90% of the whole processing time, so I need to optimize this op more.

I found torch uses OpenMP or TBB’s “parallel for” and Intel and facebook collaborated to make a faster one called IPEX-pytorch(mabye it is out of box).

Does anybody can tell me how to implement a faster NonZero op?

I tried Openvino and Onnxruntime.

  1. Openvino can optimize some ops with weight but it doesn’t support Nonzero op.
  2. Onnxruntime is almost same as torch’s nonzero op.

Thank you!