I am trying to implement a model in which there are categories (genera) and subcategories (species) which I am trying to classify images into. I want to build a model which explicitly separates the output into `P[species|data] = P[genus | data] * P[species | genus, data]`

. I am implementing this by having one layer compute `Log[P[genus|data]]`

and adding that to another layer which computes `Log[P[species | genus, data]]`

. Unfortunately the various genera have different numbers of species in the dataset. So, in order to compute the final output I need to transform `Log[P[genus|data]]`

into a tensor where each entry is repeated a number of times equal to the number of species in each genus. I am achieving this right now in a very inefficient way which relies on a python list comprehension as follows: (`G`

is a tensor of size `[batch_size, n_genera]`

giving the genus log-probabilities and `S`

is a tensor of size `[batch_size, n_species]`

giving the species log-probabilities with `n_species > n_genera`

):

```
genus_expand = torch.transpose(torch.stack([g for s,g in zip(n_species_list,torch.transpose(G,0,1)) for i in range(s)],0),0,1)
final_prob = S + G
```

where `n_species_list`

is a list of length `n_genera`

, of integers giving the number of species in each genera.

Anyway, all that is just to explain what I am doing. Implemented this way I only seem to be able to get about 70% GPU utilization (an implementation without this line, where I only use `P[species|data]`

so its entirely on the GPU and no python list comprehension) gets more like 90-95% utilization). I want to use something like `tile`

but since the number of repetitions is different for different genera I donâ€™t think I can make it work without essentially reproducing the above line of code.

Let me know if my explanation of my goal here is not fully clear.