Yeah, well, the usual way to do these things is to grab an input and try to measure the function in isolation.
Benchmarking has quite a few pitfalls, in particular with CUDA asynchronous computation involved, so it’s hard to say whether you found something where PyTorch is indeed terribly slow or whether you’ve just screwed up your benchmarking (I have certainly done that before, too).
Best regards
Thomas