What is the purpose of eps in the REINFORCE example?

jkking · September 10, 2022, 1:47am

PyTorch’s github provides an example implementation of the REINFORCE algorithm. In calculating standardizing the rewards, it adds a term eps:

returns = (returns - returns.mean()) / (returns.std() + eps)

What is the purpose of this term? It can be found in the example here:

github.com

pytorch/examples/blob/main/reinforcement_learning/reinforce.py#L50


      
              def forward(self, x):
                  x = self.affine1(x)
                  x = self.dropout(x)
                  x = F.relu(x)
                  action_scores = self.affine2(x)
                  return F.softmax(action_scores, dim=1)
          
          

          
policy = Policy()
          optimizer = optim.Adam(policy.parameters(), lr=1e-2)
          eps = np.finfo(np.float32).eps.item()
          
          

          
def select_action(state):
              state = torch.from_numpy(state).float().unsqueeze(0)
              probs = policy(state)
              m = Categorical(probs)
              action = m.sample()
              policy.saved_log_probs.append(m.log_prob(action))
              return action.item()

ptrblck · September 10, 2022, 2:26am

A small eps value is usually added to a division to avoid dividing by zero which would create invalid outputs and invalid gradients. Often it’s picked to be e.g. eps = 1e-6.

jkking · September 10, 2022, 5:12pm

Is there any intuition to the very explicit definition given in the example? Setting it to something very small but nonzero makes sense, but that explicit a call (starting with np.finfo) seems prescriptive?

vmoens · September 10, 2022, 5:34pm

The call to np.finfo is to link the value of eps to the dtype being used.
If you’re using double precision or single precision, the numerical instability will be different for a given value of the denominator when it gets closer to 0.
By using finfo you’re making sure that your eps is as big as it should be given the data type you’re using.