I am trying to solve a constraint optimization problem in offline RL/bandit (in a recommendation context).

The policy is h(y | x), where y is an action, x is the current input. The policy is parameterized by simple MLP (x → h1 → h2). I have logged data from a logging policy h_0(y | x). The optimization formulation looks like:

max Utility(h(y | x))

such that, KL-divergence(h || h_0) < eps.

Where the constraint is helpful in the low-data setting by falling back to the logging policy.

I am implementing it in an alternate optimization formulation with the following pseudo-code:

```
optimizer = Adam(lr, weight_decay=1e-5)
for i, batch in dataloader1:
action, context = batch['action'], batch['context']
utility_loss = get_utility(action, context)
optimizer.zero_grad()
obj.backward()
optimizer.step()
for j, batch in dataloader2:
action, context, = batch['action'], batch['context']
kld_loss = 10/np.sqrt(total_data) * get_kld(policy, action, context, logging_policy)
optimizer.zero_grad()
kld_loss.backward()
optimizer.step()
if j > 1:
break
```

I am multiplying the kld-loss with 10/sqrt(total_points), so that when total number of logged actions are very large, the kld-loss decreases, since we can have high confidence on the utility loss.

The strange behavior I am observing is, even when the number of datapoints are 10^9, the performance with kld-loss is better than w/o kld-loss, even though the weight is 0.00031. I have set the weight to 0 also manually, but same behavior. Any idea why it is happening?