Hi

I’m working on a problem involving sensitivity analysis and hoping to use pytorch and it’s in-build operations instead of coding everything from scratch in CUDA.

I’ve a small example code (using NN example as most people would be familiar with this here) as follows, where computations involving `dZ`

and `dA`

are independent of that of `Z`

and `A`

.

```
def sensitive(d_inp, inp, param):
Z = torch.matmul(inp, param.T)
dZ = torch.matmul(d_inp, param.T)
A = torch.tanh(Z)
dA = torch.unsqueeze(1 - torch.tanh(Z)**2, axis=1) * dZ
return A, dA
```

I want to parallelise the code such that `Z`

and `dZ`

are computed in parallel, followed by the parallel evaluation of `A`

and `dA`

.

I was looking for solution to this but couldn’t find anything. Hope someone can help me out here.

Thanks