I have three tensors with very different shape on the last dimension, let us say `tensor_a: (5, 35), tensor_b: (5, 70) and tensor_c: (5, 10)`

. I need to apply exactly the same transformation function f() to these tensors, `f(tensor_a), f(tensor_b), f(tensor_c)`

. Function f() consists of a bunch of layers. This is implemented easily with Pytorch.

But I noticed Gpu usage by doing sequential like that is low (around `50%`

). I sped up by adding padding for `tensor_a`

and `tensor_c`

to make them to the same shape of `tensor_b, i.e.(5, 70)`

first, then combining these three together into a single one and calling `f(combing_tensor)`

instead. After that I simply slice the tensor again (:5, 5:10, 10:) to get thee resulting tensors I want. With this I noticed GPU usage is around `90% `

and thus faster. However, it has a downside that it requires a lot of extra GPU memory because of padding, so it is not a perfect solution.

I wonder how to do it better in my case? I think need sort of embarrassing parallel in Pytorch but googling does not give me a good answer to my problem. Please let me know if there is a way to improve. Many thanks!