Adaptive Softmax

Are there any plans to include an adaptive softmax function described in the paper “Efficient softmax approximation for GPUs” in Pytorch?
Github repo :

Hi, somebody has already implemented adaptive softmax in pytorch, MIT license, see:

There’s a bug there, see comment here:

I’m currently running a language model experiment with this, and it seems to be working, but don’t know enough to fully ascertain the quality of this implementation.

Thanks. Yes its the only implementation I found other than the lua package released by the authors. I was looking for a Pytorch module which I could use off the shelf without getting into too much detail but it turns out I’ll have to. Could you comment on your perplexity score, vocabulary size and the relative speedup? Just curious if its worth the trouble.

Sure, I can get back on the relative speeds tomorrow for both training and inference. The model is not for English on any public dataset, though. Vocabulary size is 800k.
Speed is also the relevant factor for us, but integrating it from the example above in an existing Pytorch model took less than an hour, so dev costs are not that high.

For anyone interested, I wrote a blog post explaining the adaptive softmax, with a Pytorch implementation: