Does pytorch 2 exploit parallelism in a computational graph during inference?

I have a pytorch model, the forward pass looks roughly like the following

def forward(self, x):

    # following two encoders could be run in parallel
    lidar_features = self.lidar_encoder(x['pointcloud'])
    camera_features = self.camera_encoder(x['images'])
    # need to sync here
    combined_features = torch.stack((lidar_features, camera_features))
    predictions = self.prediction_head(combined_features)
    return predictions

If the model is in eval mode, is pytorch 2 smart enough to know that the lidar encoder and camera encoder can be run at the same time on the GPU, but then a sync needs to be inserted before the torch.stack? or will kernels be run in the serial order of the python code?

What about pytorch 1.X?

Both modules will be scheduled on the same CUDA stream and will thus run sequentially. You could use custom CUDA streams and synchronize the code manually and check if any performance improvements would be seen. Note that you might not see any overlaps if a) your CPU is too slow to schedule the kernels, b) if your code synchronizes and thus blocks the CPU, c) if the compute resources are already allocated by one module.

1 Like