Parallelize histogram calculation?

It’s easy to calculate histogram with torch.histc, but it’s designed to calculate for only one tensor.

To be precise, I have a large tensor (N, m) of N length m sub-tensors, or more flexible indexing (N1, N2, ..., Ni, m1, m2, ..., mj) of shape $\prod_j m_j$ tensors, and I need to calculate histograms of the separate $\prod_i N_i$ sub-tensors with the same number of bins, i.e. the output shape should be (N1, N2, ..., Ni, bins).

Since torch.histc doesn’t provide start_dim or similar options, is it possible to do this in pytorch w/o nested for loops which is extremely slow?

Thanks in advance!