Bug in Torchrl Tutorial PPO Example

Hi community,

I am following the torchrl tutorial PPO example to learn how torchrl works.

However, the Loss function part raise error in the tutorial. When I run it on my side (torchrl v0.1.1 + pytorch 2.0) it raise error:

Traceback (most recent call last)
Cell In[12], line 1
----> 1 advantage_module = GAE(
      2     gamma=gamma, lmbda=lmbda, value_network=value_module, average_gae=True
      3 )
      4 loss_module = ClipPPOLoss(
      5     actor=policy_module,
      6     critic=value_module,
     15     loss_critic_type="smooth_l1",
     16 )
     18 optim = torch.optim.Adam(loss_module.parameters(), lr)

File ~/miniconda3/envs/torch_rl/lib/python3.9/site-packages/torchrl/objectives/value/advantages.py:779, in GAE.__init__(self, gamma, lmbda, value_network, average_gae, differentiable, vectorized, advantage_key, value_target_key, value_key, skip_existing)
    765 def __init__(
    766     self,
    767     *,
    777     skip_existing: Optional[bool] = None,
    778 ):
--> 779     super().__init__(
    780         value_network=value_network,
    781         differentiable=differentiable,
    782         advantage_key=advantage_key,
    119     )
    121 self.advantage_key = advantage_key
    122 self.value_target_key = value_target_key

KeyError: "value key 'state_value' not found in value network out_keys."

Besides, there is error in the training output part of DQN example.

Is there any example cound run successfully?


Thanks for raising this, we’ll issue a fix asap

Thanks. Please let me know when your team fix it.

This is weird, I can execute the code locally on torchrl v0.1.1, torch v2.0.1 and tensordict v0.1.2
Are you using these versions?

I am using torchrl v0.1.1, torch v2.0.0 and tensordict v0.1.2.

Besides, this

print("state_spec:", env.state_spec)

also raise error on my side

AttributeError: 'InvertedDoublePendulumEnv' object has no attribute 'state_spec'

I update pytorch from v2.0.0 to v2.0.1 and the error is still there.

In addition, I have also tryied the examples in github repo.

No one works for me. For example,

$python sac/sac.py env_name="HalfCheetah-v4" env_task="" env_library="gym"

sys:1: UserWarning:
'config' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/home/hai/miniconda3/envs/torch_rl/lib/python3.9/site-packages/hydra/main.py:94: UserWarning:
'config' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
self.log_dir: sac_logging/SAC__c1c1aac0_23_07_10-15_27_52
/home/hai/miniconda3/envs/torch_rl/lib/python3.9/site-packages/torch/nn/modules/lazy.py:180: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or funct
ionality can happen at any moment.
  warnings.warn('Lazy modules are a new feature under heavy development '
/home/hai/miniconda3/envs/torch_rl/lib/python3.9/site-packages/torchrl/collectors/collectors.py:1182: UserWarning: total_frames (1000000) is not exactly divisible by frames_per_batch (1024).T
his means 448 additional frames will be collected.To silence this message, set the environment variable RL_WARNINGS to False.
Error executing job with overrides: ['env_name=HalfCheetah-v4', 'env_task=', 'env_library=gym']
Traceback (most recent call last):
  File "/home/hai/rl/examples/sac/sac.py", line 173, in main
  File "/home/hai/miniconda3/envs/torch_rl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Compose:
        Missing key(s) in state_dict: "transforms.1.standard_normal", "transforms.1.loc", "transforms.1.scale".
        Unexpected key(s) in state_dict: "transforms.0.standard_normal", "transforms.0.loc", "transforms.0.scale", "transforms.2.standard_normal", "transforms.2.loc", "transforms.2.scale".

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Can you print what your conda env looks like?

I might have an explanation:
I think (but could be wrong) that you cloned torchrl, and you’re executing the examples on main with torchrl 0.1.1, but the main branch of torchrl corresponds to 0.2.0dev (the next release).
So if i’m right, either checkout v0.1.1 on your github clone or use the nightly release.

I checkouted at tag v0.1.1, not at main branch.

$ git status
HEAD detached at v0.1.1

Could it be that you’re executing the code within path/to/torchrl and that python is struggling between importing torchrl from your conda env and the local folder named torchrl?

No, I am running coding_ppo.ipynb located at home folder (instead of rl folder). I export the .ipynb to .py and remove the rl repo folder. It still raise the same error. would you mind testing it on your side?

# %%
from collections import defaultdict

import matplotlib.pyplot as plt
import torch
from tensordict.nn import TensorDictModule
from tensordict.nn.distributions import NormalParamExtractor
from torch import nn
from torchrl.collectors import SyncDataCollector
from torchrl.data.replay_buffers import ReplayBuffer
from torchrl.data.replay_buffers.samplers import SamplerWithoutReplacement
from torchrl.data.replay_buffers.storages import LazyTensorStorage
from torchrl.envs import (
from torchrl.envs.libs.gym import GymEnv
from torchrl.envs.utils import check_env_specs, ExplorationType, set_exploration_type
from torchrl.modules import ProbabilisticActor, TanhNormal, ValueOperator
from torchrl.objectives import ClipPPOLoss
from torchrl.objectives.value import GAE
from tqdm import tqdm

# %% [markdown]
# ### Hyper parameters

# %%
# training parameters
device = "cuda"
num_cells = 256
lr = 3e-4
max_grad_norm = 1.0

# data collection parameters
frame_skip =1
frames_per_batch = 1000 // frame_skip
# For a complete training, bring the number of frames up to 1M
total_frames = 100000 // frame_skip

# PPO parameters
sub_batch_size = 64
num_epochs = 10
clip_epsilon = (
    0.2  # clip value for PPO loss: see the equation in the intro for more context.
gamma = 0.99
lmbda = 0.95
entropy_eps = 1e-4

# %% [markdown]
# ### Environment Define

# %%
base_env = GymEnv("InvertedDoublePendulum-v4", device=device, frame_skip=frame_skip)
env = TransformedEnv(
        # normalize observations
env.transform[0].init_stats(num_iter=1000, reduce_dim=0, cat_dim=0)

# %%
print("normalization constant shape:", env.transform[0].loc.shape)
print("observation_spec:", env.observation_spec)
print("reward_spec:", env.reward_spec)
print("done_spec:", env.done_spec)
print("action_spec:", env.action_spec)
# print("state_spec:", env.state_spec)

# %% [markdown]
# ### PPO Policy

# %%
actor_net = nn.Sequential(
    nn.LazyLinear(num_cells, device=device),
    nn.LazyLinear(num_cells, device=device),
    nn.LazyLinear(num_cells, device=device),
    nn.LazyLinear(2 * env.action_spec.shape[-1], device=device),
policy_module = TensorDictModule(
    actor_net, in_keys=["observation"], out_keys=["loc", "scale"]
policy_module = ProbabilisticActor(
    in_keys=["loc", "scale"],
        "min": env.action_spec.space.minimum,
        "max": env.action_spec.space.maximum,
    # we'll need the log-prob for the numerator of the importance weights
print("Running policy:", policy_module(env.reset()))

# %% [markdown]
# ### Value Network

# %%
value_net = nn.Sequential(
    nn.LazyLinear(num_cells, device=device),
    nn.LazyLinear(num_cells, device=device),
    nn.LazyLinear(num_cells, device=device),
    nn.LazyLinear(1, device=device),

value_module = ValueOperator(

print("Running value:", value_module(env.reset()))

# %% [markdown]
# ### Data collector and Replay buffer

# %%
collector = SyncDataCollector(

replay_buffer = ReplayBuffer(

# %% [markdown]
# ### Loss function

# %%
advantage_module = GAE(
    gamma=gamma, lmbda=lmbda, value_network=value_module, average_gae=True
loss_module = ClipPPOLoss(
    # these keys match by default but we set this for completeness

Why is the out_keys empty?
That seems like the most obvious explanation for the error message above: I think if you comment out that line things should work ok

hmm… This is the issue.

When I was reading the tutorial, I though the ValueOperator here should be equal/similar to TensorDictModule. Then I try to replace ValueOperator to TensorDictModule and add ‘out_keys=[]’ (it should be out_keys = None?). I find it works by

print("Running value:", value_module(env.reset()))

with both ValueOperator and TensorDictModule.

Thanks for you time for my noob bug here.


ValueOperator automatically computes the out_keys for you. If you’re using a regular TDModule, you can pass out_keys=["state_value"] or whichever other name you please (provided you tell GAE and PPO where to find the value)