How to calculate concatenated outputs simultaneously?

i would like to have a custom concatenation layer which receives several Conv2d layers and returns the outputs concatenated.
an application can be to have a layer which has several kernel types (for instance, several Conv2d layers with a different dilation for each one, or a different kernel size for each one).

however, i do have the concern that perhaps because i’m explicitly concatenating the outputs instead of “going deep” and looking into the Conv2d implementation itself and changing it that i will not get optimal calculation speeds.
after all, in the below defined .forward() function the outputs are calculated one after the other in a for loop.

is there an implementation method which makes sure i’m not giving up on speed to get this wanted result? perhaps some kind of an “apply” functionality which allows simultaneous calculation of all needed outputs?

here’s my naive implementation:
def init(self, module_list):
super(Concat_Block, self).init()
self.module_list = module_list
def forward(self, x):
output = []
for i, module in enumerate(self.module_list):

I haven’t used/tested this API that’s available in pytorch.
Can you try parallel_apply?