I am trying to implement an efficient histogram method in PyTorch.
I know PyTorch already has a histc and bincount though there is no 2D version of that.
I am looking to implement it in parallel to be fast; i.e for a vector of k values and using n bins, I want to do k*n operations in parallel.
I have looked at max grid & block size for CUDA and it is doable giving my number of bins and size of vector.
How should I go about and do this to be integrated with PyTorch?