Asynchronous Training of Deep Learning Models

I am thinking of how would it be if I can create asynchronous forward function in sub-class of nn.Module . When I came across architecture in attached image, I felt that it would be faster if we could train pass through each branch parallelly.

pic

Another scenario I have come across is when input is of shape (B, C ,V ,T) , where :
B: batch_sz , C: node_features , V : num_nodes , T : time_stamps
So, there is a graph for each time-stamp corresponding to each instance in the batch. Now there is a module which calculates Adj_matrix of a graph based on its node_features… It would be compute intensive if we club (B,C,V,T)->(BxT, C,V) , and then treat (B*T) as new batch , where T~=250 , B=32…

Is there some work-around that we can evaluate adj_mat efficiently… This problem is not only limited to calculation of adj_mat through my custom module, but rather to use any graph layer from PyG, we need to club B,T

I’m not talking about training on multiple GPU, but making calls asynchronously rather than sequentially.

Please answer both the question , thanks in advance