Hi, I’m implementing the Vanilla Policy Gradient (REINFORCE) with GAE for advantage estimation with spinningup implementation as a reference.
During the learning, I found the significant difference of the performance between mine and the spinningup one. my implementation took about 1255 seconds while the spinningup only 169 seconds.
detail of the performance
spinningup vpg
43216335 function calls (40435708 primitive calls) in 169.783 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1200300 18.082 0.000 18.082 0.000 {method 'matmul' of 'torch._C._TensorBase' objects}
4050 14.348 0.004 14.348 0.004 {method 'run_backward' of 'torch._C._EngineBase' objects}
808500 13.275 0.000 13.275 0.000 {built-in method tanh}
200150 12.876 0.000 12.876 0.000 {method 'logsumexp' of 'torch._C._TensorBase' objects}
1212750 9.589 0.000 38.420 0.000 functional.py:1355(linear)
3033950/404250 9.217 0.000 74.797 0.000 module.py:531(__call__)
2830 6.419 0.002 6.419 0.002 {method 'read' of '_io.FileIO' objects}
12450 5.643 0.000 5.643 0.000 {built-in method addmm}
200300 4.866 0.000 4.866 0.000 {built-in method as_tensor}
1212750 4.790 0.000 4.790 0.000 {method 't' of 'torch._C._TensorBase' objects}
200000 4.332 0.000 7.135 0.000 cartpole.py:91(step)
200150 3.459 0.000 17.169 0.000 categorical.py:44(__init__)
404250 3.391 0.000 69.865 0.000 container.py:90(forward)
1212750 3.138 0.000 42.537 0.000 linear.py:86(forward)
200050 2.890 0.000 104.369 0.001 core.py:126(step)
1 2.766 2.766 157.027 157.027 vpg.py:89(vpg)
mine
39947968 function calls (37109426 primitive calls) in 1255.151 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
4050 757.922 0.187 757.922 0.187 {method 'run_backward' of 'torch._C._EngineBase' objects}
4000 307.777 0.077 335.365 0.084 vpg.py:248(_compute_value_function_loss)
4150 27.596 0.007 27.596 0.007 {method 'mean' of 'torch._C._TensorBase' objects}
1 22.205 22.205 1253.417 1253.417 vpg.py:59(learn)
1200150 20.096 0.000 20.096 0.000 {method 'matmul' of 'torch._C._TensorBase' objects}
808200 13.866 0.000 13.866 0.000 {built-in method tanh}
200050 13.662 0.000 13.662 0.000 {method 'logsumexp' of 'torch._C._TensorBase' objects}
3232800/404100 10.787 0.000 101.064 0.000 module.py:531(__call__)
1212300 10.297 0.000 42.361 0.000 functional.py:1355(linear)
12150 6.157 0.001 6.157 0.001 {built-in method addmm}
As you see, the backward propagation took the most of the execution time.
In the codes, they ran value function updates 80 times by default like
for _ in range(self.n_value_gradients):
all_values = self.value_function(all_observations_tensor)
value_loss = self._compute_value_function_loss(all_values, discounted_returns_tensor)
self.value_function.optimizer.zero_grad()
value_loss.backward()
self.value_function.optimizer.step()
for now, I confirmed below parameters are the same between the implementations:
- the number of the network parameters: policy: 4610, value_fn: 4545
- network architecture (two hidden layers with 64 units)
- total environment interactions
- number of value function updates
- learing rate both on policy and value function
- gym environment: CartPole-v0
Could you give me an advice to improve this?
my implementation
Spiningup
document
code