Deep Deterministic Policy Gradient implementation

Hi, I want to use DDPG in my project so I set out to first get a working example. I’ve found this nice implementation in Keras ( which works well and decided to translate it to PyTorch 1:1. Unfortunately the model doesn’t learn. Since I have no friends to ask, could someone please have a look on my code and try to spot where I’ve made a mistake?

Do you have an idea which part of the code might be wrong or where you were unsure, how to implement a specific use case in PyTorch?

If not, you could try to check the model implementation first e.g. by using a constant input (tensor of all ones should work). Of course you would need to load the parameters from one model to the other and I don’t know, how hard this would be.
Once the models return the same output, you could debug the code step by step.

Find 1 Problem: where is terminal?:

reward + discount * ~terminal * next_value

Other parts seems to be ok.

@ptrblck I was most unsure about how to define a network that has parallel input processing with result concatenation. See Critic class and it’s forward method.

@iffiX it’s an interesting point. I think you are referring to the definition of ‘y’. But in the original Keras code there is no terminal as well in that line.

The Critic implementation looks correct to me.
Note that I’m not deeply familiar with Keras and assume that the Concatenate layer concatenates the input tensors in the “feature dimension”.

I was pointed out that I should model.eval() and model.train(), every time I want to call my models. It’s because I use batch normalization and it operates differently depending on train flag. Now the model learns correctly.
Is it a good idea to always wrap model calls with eval/train?

r = model(input)

Oh, that point was neglected, “eval” and “train” are not needed if you are not using something special like

Yes, I would recommend to always call model.train() before the training and model.eval() before the evaluation or testing of the model. Even if your current model does not use any batchnorm or dropout layers, the model itself or custom layers might use the attribute to change their behavior.

Somehow this doesn’t seem right:


        target_actions = target_actor(next_state_batch)
        y = reward_batch + gamma * target_critic([next_state_batch, target_actions])
        critic_value = critic_model([state_batch, action_batch])
        critic_loss = torch.mean(torch.square(y - critic_value))


Didn’t I just exclude batch norm parameters from training effectively getting rid of normalization? I mean this is part of the training routine so it seems logical to have the model in train() mode. I have read the original Batch Norm paper and it seems to confirm that.
Now I’m really confused.
Just for testing I entirely commented out batch normalization and the model is training well without it.

So the conclusion is, there is something wrong with my batch normalization and I still don’t know what…

I missed that the previous post was related to the model implementation and not a general question.
By calling critic_model.eval(), you would use the running stats instead of the batch stats to normalize the data. The affine parameters of the batchnorm layers would still be trained.
I’m not familiar with the model, but why did you add this line of code and what is the Keras model doing?

I’ve added this line because I tried all combinations of eval and train, and learning works only when i put eval everywhere. But that probably means that batch norm is not used at all.
I also tried to remove batchnorm layers altogether and it also enables learning.
Keras model probably also has a slight bug as it always keeps batchnorm layer in evaluation mode. But surprisingly, when I put it in training mode then learning abilities are not affected.
I am thoroughly confused right now and I will probably go carefully through the implementations to debug what’s going on. Just like you suggested in your first post. I think I need to debug my autograd chain. Is there a way to see it as a graph?

Try use this:

from torchviz import make_dot

def visualize_graph(final_tensor, visualize_dir="", exit_after_vis=True):
    Visualize a pytorch flow graph

        final_tensor: The last output tensor of the flow graph
        visualize_dir: Directory to place the visualized files
        exit_after_vis: Whether to exit the whole program
            after visualization.
    g = make_dot(final_tensor)
    g.render(directory=visualize_dir, view=False, quiet=True)
    if exit_after_vis:

Thank you for your help. I’ve spend some more time with my models and what I’ve found out is that nn.BatchNorm1d (PyTorch) works differently than layers.BatchNormalization (Keras). There are some discussions about that on the internet but not conclusive. I can’t work out how to make them behave the same.

In the end I’ve replaced nn.BatchNorm1d with nn.LayerNorm and the agent learns flawlessly.

I would really like to know what’s wrong with nn.BatchNorm1d in PyTorch. Until I find out I’ve got to avoid it.

Any update on this? I started using Pytorch for my RL model and I experienced too that the behavior of BatchNormalization in Tensorflow is different from nn.BatchNorm1d of PyTorch. Have you just applied LayerNorm paying attention to call .train() and .eval() in the right places?

I have found another thread where it is said that BatchNorm1d when on model.train() does not use running stats to normalize the input of the layer, which seems confusing to me. Is that mean that the layer just compute running stats and normalize input only when evaluating the model? I would say that in this case the learning would be really wrong

Actually, I think it turned out that the problem was in Keras… I have reported an issue there and they have removed batch normalization from the example in question, because it prevented the model from learning.
Please see this thread: DDPG example uses BatchNormalization incorrectly · Issue #198 · keras-team/keras-io · GitHub.

And about your question, batch norm normalizes differently during training and inference. During training stats of current batch are used, during inference, moving stats of all seen batches are used.

(batch - mean(batch)) / (var(batch) + epsilon) * gamma + beta
(batch - self.moving_mean) / (self.moving_var + epsilon) * gamma + beta
Source: BatchNormalization layer

In my view, PyTorch docs say the same, although maybe in a bit less concise form
..Also by default, during training this layer keeps running estimates of its computed mean and variance, which are then used for normalization during evaluation...
Source: BatchNorm1d — PyTorch 1.8.1 documentation

In my project I also removed batch norm, and layer norm too. Although layer norm worked well, but pure MLP without any normalization worked too so why complicate. I’ve read somewhere that Layer norm in general behaves better than BatchNorm