WTA Softmax for Extreme Learning with more than a thousand or millions of class labels

I have been looking for a PyTorch, TensorFlow or Numpy Implementation of Winner Take it All ( WTA ) Softmax. I have been reading that it helps with extreme learning with millions of class labels. A case that happens quite often in production. Moreover, I found some benchmarks explaining how it excels Hierarchical Softmax. Besides its simplicity to apply for vision problems.

Can any one help?