JITed GRU too slow

I want to create some custom GRU variants (mainly Layer normalized).
I was following some blog posts and benchmarks (like https://github.com/pytorch/pytorch/blob/master/benchmarks/fastrnns/custom_lstms.py) and I created following small benchmark https://gist.github.com/usamec/af21be7b83e6b1a3f38c26136af811f3, where it seems, that JITed GRU is 10 times slower than cuDNN implemention (but 3times faster than nonJITed). (Using GeForce RTX 2080 Ti).

It is advertised, that at least JIT forward pass runs similarly than cuDNN, so what am I doing wrong?
(I tried running it multiple times, so results are not affected by cold start).

As far as I know, cuDNN is the fastest in all circumstances. And JIT optimizes the code as it runs, so later batches run faster than the first few batches.

I tried several runs with same inputs. So cold start is not an issue.

can you post your benchmark script? 10 times slower is very weird as we don’t expect that much.

There is a linked script in original post: https://gist.github.com/usamec/af21be7b83e6b1a3f38c26136af811f3

Also when I increase size of GRU to 1024 features, then the relative difference is much smaller (32 ms JIT, 25 ms cuDNN).

I have the same problem, though my variant is closer to RNN than GRU, so I would think it would be even easier to optimize. Increasing the size has the same effect of shrinking the performance gap, but I need the feature size to be relatively small.

Would be great to figure out how to improve performance of jit RNNs. It’s a really simple model (and inference in C++ is lightning fast), so I’m not sure why it’s so slow.

Could you please open an issue on github and paste in the no. here? I will be taking a look at this very shortly.