I am trying to train a neural network which takes as input (
input_t0) and an initial hidden state (call it
s_t0) and produces a new hidden state (
s_t1) by transforming the input via a series of transformations (neural network layers). At the next time step, a transformed input (
input_t1) and the hidden state from the previous time step (
s_t1) is passed to the same model. This process keeps repeating for a couple of steps.
The goal of optimization is to ensure the distance between
s_t1 is small through self-supervision, as
s_t1 is supposed to be an transformed version of
s_t0. In other words, I want
s_t1 to only carry new information in the new input. My intuition tells me taking the norm of the weights and ensuring the norm does not go to zero (is this even possible?) would be one way to achieve this. However, I’m afraid won’t be the best thing to do necessarily, as it might not encourage the model to update the state vector with new information.
Currently the way I train the model is by taking the absolute distance between
loss = torch.abs(s_t1 - s_t0).mean(dim=1). Then I call
optimizer.step() which changes the weights. Note that the reason that I use
abs() is that the hidden states are produced after applying ReLU, so the only hold positive values. So what is the best way to achieve this and ensure the weights don’t go to 0? Would I be able to somehow use mutual information for this?
However, I noticed that optimization quickly finds the trivial solution by setting weights to 0. This causes both
s_t1 get smaller and smaller until their difference is 0, which satisfies the constraint but does not yield the behavior I expect. Is there a way to ensure weights do not go to zero during optimization?