I am running on one 24-core CPU and one A100 GPU. Each subprocess runs the following (stripped down) code:

```
# general setup
dist.init_process_group('gloo', rank=rank, world_size=num_envs)
torch.set_num_threads(1)
# set up local model and optimizer
local_model = DDP(model, device_ids=[device])
optim = Adam(local_model.parameters(), lr=.001)
'''
save the trajectory from an episode.
- action_lgprobs[i] is the log
probability of the action sampled
at step i
- entropies[i] is the entropy of the
distribution returned by the model
at step i
- rewards[i] is the reward assigned to
the i'th action by the environment
action_lgprobs and entropies are
attached to the computational graph.
'''
action_lgprobs, entropies, rewards = \
run_episode(
env,
local_model,
)
returns = compute_returns(rewards, discount)
# compute baselines
baselines = returns.clone()
dist.all_reduce(baselines)
baselines /= num_envs
advantages = returns - baselines
# compute loss
policy_loss = -advantages @ action_lgprobs
entropy_loss = entropy_weight * entropies.sum()
loss = policy_loss + entropy_loss
# update model
optim.zero_grad()
torch.cuda.synchronize()
start = time.time()
loss.backward() # slow!
torch.cuda.synchronize()
end = time.time()
print('BACKWARD TIME:', end - start)
optim.step()
```

I am finding the performance of `loss.backward()`

to scale rather poorly with the number of subprocesses. E.g. with only one subprocess, it takes ~3s for `backward`

to finish, while with 16 subprocesses it takes ~24s to finish on all of them (each subprocess prints the same duration). I’m wondering, is there anything I can do to speed it up? Any ideas will be appreciated.