How to parallelize a loop over the samples of a batch

How is the RPC framework different from what this tutorial shows (Writing Distributed Applications with PyTorch — PyTorch Tutorials 1.7.1 documentation)? That uses send/recv and all_reduce etc… there are so many options that this confusing and frustrating. I don’t understand why (I understand the error by why can’t it just do rpc.map or something for me and allow me to use gradients)

with Poo(100) in pool:
    losses = pool.map(forward, batch)
    torch.mean(losses).backward()
optimizer.step()

works…


rpc example: python - How does one implement parallel SGD with the pytorch autograd RPC library so that gradients can be received from different processes without errors? - Stack Overflow


My intetion was to parallelize meta-learning with torchmeta + higher but it seems that path is dead using DDP until higher is incorporated into the core of pytorch. See:

but the RPC path might not be dead: