Is there a way to parallelize independent sequential steps?

I am currently working with a model which has a fixed set of heavy operations applied on different versions of the same input. It would be really nice if I could perform these operations in parallel as each one of them takes a big chuck of total execution time. In Theano, with graph optimizations, this would have been automatically executed in parallel. But I am not sure how to parallelize these operations in PyTorch. I looked into Python’s multiprocessing module and PyTorch’s wrapper for it. But I am not sure if it’s usage would maintain the autograd’s graph integrity and how they’d work while performing backprop (which is actually an even more expensive than forward propagation).

An NDA prevents me from sharing the exact model details and code, but here is a representative example of the model that captures the problem.

def forward(x):
    # transform the input in 4 different ways - transform is a lightweight operation
    x1 = transform(x, 0, 0)
    x2 = transform(x, 0, 1)
    x3 = transform(x, 1, 0)
    x4 = transform(x, 1, 1)

    # process each transformed input by the same function
    # my_func here is time consuming and the current bottleneck, but is internally optimized
    # ideally, all four `my_func` calls below should execute in parallel
    a = my_func(x1)
    b = my_func(x2)
    c = my_func(x3)
    d = my_func(x4)

    result = a + b + c + d
    
    return result

Is there any way I could speed up this function? Not being able to parallelize this hurts me even more when I stack many such steps, which I am planning to do next. So any suggestions to speed this up in any way would help a lot.

Thanks!

1 Like

After seeing some discussion on Twitter, it looks like having lazy evaluation would defeat some of the goals of a truly dynamic tensor manipulation framework. But for cases like the example above at least, having a declarative way that would allow for parallel execution of several operations would be very useful.

Depending on how complex my_func is, you may be able to get some mileage out of wrapping it in nn.DataParallel and running each of them on separate GPUs if you have several available. But you can also rewrite my_func to take a list of variables and manually batch certain suboperations together; that’s likely to be the fastest approach.

Hi pranav, it seems that static graphs have upper hand at these kind of parallel tasks. In the official torchvision InceptionV3, inception modules are implemented serially.