The obvious thing would be to use torch.multinomial(torch.ones(n, d), 4), which would take out ~1/3 of the time for me but is somewhat slow.
I can half the time (relative to your code, on my machine, on CPU etc.) by using rand + topk.
I imagine for k much smaller than d you could be blazingly fast writing your own GPU kernel that just loops until it has found four new things because the 37700 is just one giant opportunity for parallelization on the GPU.
Had I known you went to stackoverflow to get the same answer, I would have not went through the trouble of benchmarking things.
I spent 15 minutes on your problem trying to help you only to find that you would have been OK without.
Thank you for your answer. Actually, your answer works better for me than the one in StackOverflow both in terms of running time and similarity to my own code, with only one line changed.
But I think I have the right to seek my answer from different sources, right? It’s not against the community guidelines, I suppose. Also, someone else could have posted an answer here (instead of StackOverflow) a few minutes before you, so I don’t think it’s my fault. Anyway, I really appreciate your time, and your answer not only helped me but could also help other people in the future.
Oh, you have every right to ask where ever you want how often you want and so.
I would think that is is common courtesy and sound use of a resource to not ask the same question in multiple places causing many people to invest time helping you when they could have helped the next person instead with much of the same outcome.
I feel I should have done something else with my time in retrospect, but yeah, maybe then I should not be answering questions.