Nms gpu cost time so more than caffe

Yeah, well, the usual way to do these things is to grab an input and try to measure the function in isolation.
Benchmarking has quite a few pitfalls, in particular with CUDA asynchronous computation involved, so it’s hard to say whether you found something where PyTorch is indeed terribly slow or whether you’ve just screwed up your benchmarking (I have certainly done that before, too).

Best regards

Thomas