I am trying to train a neural network which takes as input (`input_t0`

) and an initial hidden state (call it `s_t0`

) and produces a new hidden state (`s_t1`

) by transforming the input via a series of transformations (neural network layers). At the next time step, a transformed input (`input_t1`

) and the hidden state from the previous time step (`s_t1`

) is passed to the same model. This process keeps repeating for a couple of steps.

The goal of optimization is to ensure the distance between `s_t0`

and `s_t1`

is small through self-supervision, as `s_t1`

is supposed to be an transformed version of `s_t0`

. In other words, I want `s_t1`

to only carry new information in the new input. My intuition tells me taking the norm of the weights and ensuring the norm does not go to zero (is this even possible?) would be one way to achieve this. However, I’m afraid won’t be the best thing to do necessarily, as it might not encourage the model to update the state vector with new information.

Currently the way I train the model is by taking the absolute distance between `s_t0`

and `s_t1`

via `loss = torch.abs(s_t1 - s_t0).mean(dim=1)`

. Then I call `loss.backward()`

and `optimizer.step()`

which changes the weights. Note that the reason that I use `abs()`

is that the hidden states are produced after applying ReLU, so the only hold positive values. So what is the best way to achieve this and ensure the weights don’t go to 0? Would I be able to somehow use mutual information for this?

However, I noticed that optimization quickly finds the trivial solution by setting weights to 0. This causes both `s_t0`

and `s_t1`

get smaller and smaller until their difference is 0, which satisfies the constraint but does not yield the behavior I expect. Is there a way to ensure weights do not go to zero during optimization?