With Torch(1.13.x), I’ve been trying to implement some activation functions from scratch like mish or ELU, etc. for custom activation function.
However, I get nan value of loss after about 17 epochs when I train the model.
- dataset: official MNIST dataset from each framework
- model architecture: simple dense network(25 layers with 500 neurons each)
- lr: 1e-3 (I don’t want to fix this)
- batch_size: 128
- optimizer: Adam
Torch code:
class Mish_Implementaion(nn.Module):
def __init__(self):
super(Mish_Implementaion, self).__init__()
self.__name__ = 'Mish'
def forward(self, x):
return t.where(x < -7, 0, t.where(x > 30, x, x * t.tanh(t.log(1 + t.exp(x)))))
I used:
t.autograd.set_detect_anomaly(True)
and got this error message: Function 'ExpBackward0' returned nan values in its 0th output.
I guess it’s because of exp. function getting overflow. But that’s why I used torch.where function to avoid exp() return too high of a value.
I want to add some trainable parameter here, so making this work would be important.
Any advice is really appreciated, Thanks in advance.