I am trying to train a neural network which takes as input (input_t0
) and an initial hidden state (call it s_t0
) and produces a new hidden state (s_t1
) by transforming the input via a series of transformations (neural network layers). At the next time step, a transformed input (input_t1
) and the hidden state from the previous time step (s_t1
) is passed to the same model. This process keeps repeating for a couple of steps.
The goal of optimization is to ensure the distance between s_t0
and s_t1
is small through self-supervision, as s_t1
is supposed to be an transformed version of s_t0
. In other words, I want s_t1
to only carry new information in the new input. My intuition tells me taking the norm of the weights and ensuring the norm does not go to zero (is this even possible?) would be one way to achieve this. However, I’m afraid won’t be the best thing to do necessarily, as it might not encourage the model to update the state vector with new information.
Currently the way I train the model is by taking the absolute distance between s_t0
and s_t1
via loss = torch.abs(s_t1 - s_t0).mean(dim=1)
. Then I call loss.backward()
and optimizer.step()
which changes the weights. Note that the reason that I use abs()
is that the hidden states are produced after applying ReLU, so the only hold positive values. So what is the best way to achieve this and ensure the weights don’t go to 0? Would I be able to somehow use mutual information for this?
However, I noticed that optimization quickly finds the trivial solution by setting weights to 0. This causes both s_t0
and s_t1
get smaller and smaller until their difference is 0, which satisfies the constraint but does not yield the behavior I expect. Is there a way to ensure weights do not go to zero during optimization?