Locally branched architectures

If an architecture has local branching in the convolution, the most famous example being the Inception blocks where multiple convolutions at different scales are being taken and the results combined, then in PyTorch each convolution will be sequentially evaluated. For example, in the example implementation of an inception block. Doesn’t that mean PyTorch will be much slower than Tensorflow for this kind of architecture?

Or, does it not matter because the convolutional are expensive enough they can’t be evaluated in parallel anyway? ie in practice we don’t actually care?

1 Like

Good question. From my experience, pytorch conv blocks usually utilize GPU to 95+% so in such cases I don’t think parallelism on this level will be helpful.

Even though the python interface is sequential, you should know that the cuda api is completely asynchronous.
This means that if you run 3 convolutions in pytorch (without explicit synchronization), the 3 will run at the same time (or one after the other depending on your GPU compute capacity).
You can see this for example when trying to time cuda operations, if you don’t put any explicit synchronization, the call to the cuda function will return almost imediatly. The one that is going to be slow is the one that will make use of the results of the computation as this one will have to wait for the computations to be finished.

1 Like

Oh yeah, I remember this being a problem years ago when I was benchmarking Theano code. Totally forgot cuda could do that. Thanks!