Hello!
I solved it in this pr
The issue was that your policy wasn’t completely on CUDA. Part of it was still on CPU (namely the exploration module). Then when you ask the collector to run it, it sends everything to cuda. Here there was an issue which made the collector lose track of the origin tensor (this is what the PR is solving).
You need to correct your script a bit though:
- you could do
agent_explore = agent_explore.to(device)
, in which case you don’t need the PR - If you don’t do that, use the PR (nightly build) and add a call to
collector.update_policy_weights_()
just after your model update
[...]
total_count += data.numel()
total_episodes += data["next", "done"].sum()
if i % 10 == 0:
my_logger.info(f"Step: {i}, max. count / epi reward: {max_length} / {max_reward}.")
collector.update_policy_weights_()
That will copy the CPU buffers on GPU.
I also spotted a bug when you have partial devices (ie one for the policy and one for the env) which I fixed in [BugFix] Fix device transfer for collectors with init_random_frames mixed devices by vmoens · Pull Request #2704 · pytorch/rl · GitHub
With these changes I solve the whole thing in a similar number of iterations in every case.
LMK if that works!