I want to create some custom GRU variants (mainly Layer normalized).
I was following some blog posts and benchmarks (like https://github.com/pytorch/pytorch/blob/master/benchmarks/fastrnns/custom_lstms.py) and I created following small benchmark https://gist.github.com/usamec/af21be7b83e6b1a3f38c26136af811f3, where it seems, that JITed GRU is 10 times slower than cuDNN implemention (but 3times faster than nonJITed). (Using GeForce RTX 2080 Ti).
It is advertised, that at least JIT forward pass runs similarly than cuDNN, so what am I doing wrong?
(I tried running it multiple times, so results are not affected by cold start).