I have three tensors with very different shape on the last dimension, let us say
tensor_a: (5, 35), tensor_b: (5, 70) and tensor_c: (5, 10). I need to apply exactly the same transformation function f() to these tensors,
f(tensor_a), f(tensor_b), f(tensor_c) . Function f() consists of a bunch of layers. This is implemented easily with Pytorch.
But I noticed Gpu usage by doing sequential like that is low (around
50%). I sped up by adding padding for
tensor_c to make them to the same shape of
tensor_b, i.e.(5, 70) first, then combining these three together into a single one and calling
f(combing_tensor) instead. After that I simply slice the tensor again (:5, 5:10, 10:) to get thee resulting tensors I want. With this I noticed GPU usage is around
90% and thus faster. However, it has a downside that it requires a lot of extra GPU memory because of padding, so it is not a perfect solution.
I wonder how to do it better in my case? I think need sort of embarrassing parallel in Pytorch but googling does not give me a good answer to my problem. Please let me know if there is a way to improve. Many thanks!