Hi guys! I am developing a differentiable simulator, but I don’t know how to do concurrent tensor operations with autograd. I need to use concurrent tensor operations instead of batch operations because in my simulation, there’re a lot of tensors of very different shapes, but I can apply the same reduce/mapping functions to them.

For example and for the sake of simplicity, suppose we have a dummy “simulator” that based on users’ input, generates many intermediate matrices of different shapes, sums them up and add the results to some outputs. And we also need differentiability, so we need autograd. The example psudo-code is

```
def map_function(tensor):
# an expensive map function that is not batch-able
if tensor.shape[0] > 10:
return tensor.sum()
else:
return tensor * 50
def reduce_function(target, tensor):
target ?= tensor # ?= is a very complex inplace operator that modifies the value of target
if __name__ == "__main__":
# suppose the intermediate matrices are a, b, c, d
a = torch.rand(3, 3, requires_grad=True)
b = torch.rand(5, 5, requires_grad=True)
c = torch.rand(100, 100, requires_grad=True)
d = torch.rand(20, 20, requires_grad=True)
pool = multiprocessing.pool.ThreadPool()
mapped_tensors = pool.map(map_function, [a*2, b*2, c*2, d*2])
target1 = torch.rand(200, 200, requires_grad=True)
t1 = target1 * 2
target2 = torch.rand(50, 50, requires_grad=True)
t2 = target2 * 2
for t in mapped_tensors: # 1
reduce_function(t1, t)
for t in mapped_tensors: # 2
reduce_function(t2, t)
# a toy loss
loss = (t1**2).sum() + (t2**2).sum()
loss.backward()
```

Please ignore the insanity of this example code, but the main problems here are

- The map function cannot be batched, since the input tensors are of different shapes. And the shapes vary a lot, so we cannot afford padding.
- The reduce function is complex and modifying the target, so we cannot afford executing them sequentially in #1 and #2. And obviously, the loops of #1 and #2 can be concurrently executed. So we want the iterations within a for-loop and across two for-loops executed concurrently, but we also want correct gradients, which means autograd should work in this concurrent situation and no data races.

I’ve searched a bit, it seems `rpc`

could work. And from the semantics of CUDA, executing them on CUDA seems to work as well in an async manner. Please correct me if I’m wrong about these two.

For the concurrent map function, is it safe to build a graph concurrently? I think it’s okay as long as there’re no in-place operations. From Autograd Machanics, it says there may be risks to use `.backward()`

concurrently, but it doesn’t mention this simple case.

And I wonder if there are simpler solutions that works both on CUDA and CPU with little overhead. Thanks a lot!