How do I ensure that optimization does not find the trivial solution by setting weights to 0?

I am trying to train a neural network which takes as input (input_t0) and an initial hidden state (call it s_t0) and produces a new hidden state (s_t1) by transforming the input via a series of transformations (neural network layers). At the next time step, a transformed input (input_t1) and the hidden state from the previous time step (s_t1) is passed to the same model. This process keeps repeating for a couple of steps.

The goal of optimization is to ensure the distance between s_t0 and s_t1 is small through self-supervision, as s_t1 is supposed to be an transformed version of s_t0. In other words, I want s_t1 to only carry new information in the new input. My intuition tells me taking the norm of the weights and ensuring the norm does not go to zero (is this even possible?) would be one way to achieve this. However, I’m afraid won’t be the best thing to do necessarily, as it might not encourage the model to update the state vector with new information.

Currently the way I train the model is by taking the absolute distance between s_t0 and s_t1 via loss = torch.abs(s_t1 - s_t0).mean(dim=1). Then I call loss.backward() and optimizer.step() which changes the weights. Note that the reason that I use abs() is that the hidden states are produced after applying ReLU, so the only hold positive values. So what is the best way to achieve this and ensure the weights don’t go to 0? Would I be able to somehow use mutual information for this?

However, I noticed that optimization quickly finds the trivial solution by setting weights to 0. This causes both s_t0 and s_t1 get smaller and smaller until their difference is 0, which satisfies the constraint but does not yield the behavior I expect. Is there a way to ensure weights do not go to zero during optimization?