I want to compute the gradient for the parameters of a distributions in two steps, such that it is possible to decouple the code defining the distribution from the for loop training. The following works but I have to retain the whole graph (or manually delete references to `cost`

and `samples`

with `del`

, which may become unfeasible when the computational graph becomes more complex).

```
log_std = nn.Parameter(torch.Tensor([1]))
std = torch.exp(log_std)
mean = nn.Parameter(torch.Tensor([1]))
dist = dd.Normal(loc=mean, scale=std)
optim = torch.optim.SGD([log_std, mean], lr=0.01)
target = dd.Normal(5,5)
for i in range(50):
optim.zero_grad()
samples = dist.rsample((1000,))
cost = -(target.log_prob(samples) - dist.log_prob(samples)).sum()
cost.backward(retain_graph=True)
optim.step()
print(i, log_std, mean, cost)
```

I would like to do something like this instead

```
log_std = nn.Parameter(torch.Tensor([1]))
std = torch.exp(log_std)
mean = nn.Parameter(torch.Tensor([1]))
dist = dd.Normal(loc=mean, scale=std)
optim = torch.optim.SGD([log_std, mean], lr=0.01)
target = dd.Normal(5,5)
for i in range(50):
optim.zero_grad()
samples = dist.rsample((1000,))
detached_samples = samples.detach()
detached_samples.retain_grad()
cost = -(target.log_prob(samples) - dist.log_prob(samples)).sum()
cost.backward() # compute gradients up to detached samples
samples.backward(detached_samples.grad(), retain_graph=True) # get gradients of distribution parameters log_std, mean and retain only this part of the graph
optim.step() # apply update
print(i, log_std, mean, cost)
```

Even if I detach samples, when I run `cost.backward()`

I get gradients for `log_std`

and `mean`

. I don’t know if that’s intended and what I’m doing is smart or dumb or even possible, but I would like someone to shed some light on this