So I was trying different activation functions and I came across something weird. Hand crafted this model and I can pretty much get 0% error when I manually assign values to the weights and biases of my network.
Then I tell network to train and after many epochs (2500), I’m telling network to predict again. I was expecting since the original weights are almost perfect the network should just improve that small remaining error, but turns out the network diverges on training data.
I know my training routine is not a problem, as this specific network with these specific input/output is the only example that I found with this behavior. Even in the same model, if I swap my activation functions from ReLU to mish or SiLU the network can train without an issue.
test(47,)..Submodel(
(layers): Sequential(
(0): Linear(in_features=6, out_features=2, bias=True)
(1): ReLU()
(2): Linear(in_features=2, out_features=3, bias=True)
(3): ReLU()
(4): Linear(in_features=3, out_features=6, bias=True)
(output_activation): Mish()
(output_scale): ScaleAndShift(*tensor([0.0200, 0.0700, 0.0700, 0.0200, 0.0200, 0.0200], dtype=torch.float64)+tensor([-1., -3., -3., -1., -1., -1.], dtype=torch.float64))
)
)
input
[[[1.00000 1.00000 1.00000 1.00000 1.00000 1.00000]]
[[0.00000 0.00000 0.00000 0.00000 0.00000 0.00000]]
[[-1.00000 -1.00000 -1.00000 -1.00000 -1.00000 -1.00000]]]
prediction:
[[2.00000 2.00500 2.00500 2.00000 2.00000 2.00000]
[1.00000 1.02500 1.02500 1.00000 1.00000 1.00000]
[0.00000 0.03100 0.03100 0.00000 0.00000 0.00000]]
score
0.9998082146798993
done test 2500( 2) : 0.0005 -> 0.6260: v 11.2685;t 12.9918 51
prediction:
[[1.47839 1.55670 1.55706 1.47860 1.47862 1.47843]
[1.47839 1.55670 1.55706 1.47860 1.47862 1.47843]
[0.06020 -0.15821 -0.15911 0.05984 0.05971 0.06023]]
score
0.7444793404118504
Can anyone guess what’s going on here?