CUDA asynchronous execution

According to this, operations that involve GPU are processed asynchronously and in parallel.

Does this mean that if I have many modules that can independently work on the same data, they will be processed in parallel, even if I call them sequentially?

For example, let’s say I have some code that looks like this:

x = torch.randn(1,3,32,32).cuda()
conv_list = []
for i in range(30):
    conv_list.append(torch.nn.Conv2D(3,3,3,1).cuda())

# Is this executed in parallel?
output = sum(conv(x) for conv in conv_list)

Above, I’ve created 30 different conv layers that can work independent of each other. I’m calling them in a generator expression, but my understanding is that they are actually executed in parallel under the hood. Is my understanding correct?

I have a model where many sub-networks can operate independent of each other and I’d like to make it as parallel as possible. I was thinking of creating multiple torch.cuda.Stream() objects and use it for each independent module. But if my example code above does run in parallel, using multiple torch.cuda.Stream() objects would be silly.