Training converges on cpu but never on gpu

vmoens · January 20, 2025, 12:04pm

Hello!
I solved it in this pr
The issue was that your policy wasn’t completely on CUDA. Part of it was still on CPU (namely the exploration module). Then when you ask the collector to run it, it sends everything to cuda. Here there was an issue which made the collector lose track of the origin tensor (this is what the PR is solving).
You need to correct your script a bit though:

you could do agent_explore = agent_explore.to(device), in which case you don’t need the PR
If you don’t do that, use the PR (nightly build) and add a call to collector.update_policy_weights_() just after your model update

[...]
                total_count += data.numel()
                total_episodes += data["next", "done"].sum()

            if i % 10 == 0:
                my_logger.info(f"Step: {i}, max. count / epi reward: {max_length} / {max_reward}.")
        collector.update_policy_weights_()

That will copy the CPU buffers on GPU.

I also spotted a bug when you have partial devices (ie one for the policy and one for the env) which I fixed in [BugFix] Fix device transfer for collectors with init_random_frames mixed devices by vmoens · Pull Request #2704 · pytorch/rl · GitHub

With these changes I solve the whole thing in a similar number of iterations in every case.

LMK if that works!