PyTorch Math Overflow: Hidden Values Doubling every itertation

Hi

I am trying to implement recurrence over an arbitrary layer:

Here is the forward function:

def forward(self, sequence, hidden):
  
    i = 0
    for x in sequence:

        if(i==0):
            hidden[i] = x # Initially it is the input
            print("Hidden", hidden[i].data)
        else:
            hidden[i] = hidden[i-1].clone() # otherwise the previous time-step's output
        
        for layer in self.highway_layers: # This the recurrence over layers
            hidden[i] = layer(hidden[i].clone())
            print("Hidden", hidden[i].data)
           
        i = i +1
             
    return hidden, hidden # since for rnn, hidden and output are the same, if you don't want to do softmax

The hidden values soon explode (they are roughly doubling? ) and finally I get Math overflow error in loss function.

Hidden
  84.0122   89.9400   81.6280  ...    87.6645   70.6815   59.7525
  89.6357   59.4279  -87.3051  ...    30.1372  -84.1233  -96.8704
 -75.7449   75.4689  -73.7856  ...    40.0317   51.0543   38.5197
             ...                ⋱                ...
   6.2067   50.8187  -55.4566  ...    19.7606   39.6369   46.3537
  91.9438   -2.5618  -94.3562  ...   -24.0399  -38.2352  -58.3377
  97.2880  -29.3549   50.7392  ...   -39.0659  -86.5748  -47.3998
[torch.FloatTensor of size 20x200]

Hidden
 168.0244  179.8800  163.2560  ...   175.3291  141.3631  119.5050
 179.2715  118.8557 -174.6103  ...    60.2745 -168.2467 -193.7408
-151.4898  150.9377 -147.5712  ...    80.0634  102.1087   77.0395
             ...                ⋱                ...
  12.4135  101.6373 -110.9131  ...    39.5212   79.2738   92.7073
 183.8876   -5.1236 -188.7125  ...   -48.0799  -76.4703 -116.6755
 194.5760  -58.7099  101.4783  ...   -78.1319 -173.1495  -94.7997

How to fix this issue…? Thanks for your help.

I think the problem may come from self.highway_layers

thanks for the reply, I am still stuck at the issue, and the loss is just overflowing.

self.highway_layers is a list of highway layers which are implemented as:

class HighwayLayer(nn.Module):

def __init__(self, input_size, bias=-1):
    super(HighwayLayer, self).__init__()
    self.plain_layer = nn.Linear(input_size, input_size)
    self.transform_layer = nn.Linear(input_size, input_size)
    self.transform_layer.bias.data.fill_(bias)
    
def forward(self, x): # Has to get a hidden state? No.
    plain_layer_output = nn.functional.relu(self.plain_layer(x)) # Wanted variable got tensor 
    transform_layer_output = nn.functional.softmax(self.transform_layer(x))
    transform_value = torch.mul(plain_layer_output, transform_layer_output)
    carry_value = torch.mul((1 - transform_layer_output), x)
    return torch.add(carry_value, carry_value) # This returns the same size as input.

Please take a look, surely I am doing something wrong: I created the layers, checked if dimensions match, addded a hidden variable for recurrence, so it looks fine by me…but it is not working… thanks.

Is this what you intended?

transform_value is calculated but never used.

If transform_layer_output is near to 0, then the above line causes your model to rough double its input.

thanks! i never used the transform_value:

Another issue I spotted with my language model is that when in the training loop: I print model summary: I get:

LanguageModel (
(drop): Dropout (p = 0.2), weights=(), parameters=0
(encoder): Embedding(10000, 200), weights=((10000, 200),), parameters=2000000
(rnn): RecurrentHighwayNetwork (
(highway_layers): ModuleList (
(0): HighwayLayer (
(plain_layer): Linear (200 -> 200)
(transform_layer): Linear (200 -> 200)
)
(1): HighwayLayer (
(plain_layer): Linear (200 -> 200)
(transform_layer): Linear (200 -> 200)
)
)
), weights=((200, 200), (200,), (200, 200), (200,), (200, 200), (200,), (200, 200), (200,)), parameters=160800
(decoder): Linear (200 -> 10000), weights=((10000, 200), (10000,)), parameters=2010000
)

It is taking too long to get to even one epoch… does my model have too many parameters? I am on a i7 cpu.

…Total 5 million parameters are there!

That is pretty huge.

The models I work with on CPU are small recurrent models with < 100,000 parameters, and ~500,000 data samples of ~50 features. I get epochs ranging from 50s to 10 mins depending on the size of my model. That said, I have an old i5-2410M, so nothing will run particularly fast.

thanks for the reply. My model’s is recurring over a layer instead of a rnn/lstm cell, so that explains the abnormal number of parameters and even my batches are taking 15s!!

Also, I see my training loss fluctuating…it is stuck at some value is fluctuating around it after around 200 batches…

Do you know how a more typical rnn/lstm model would perform on the dataset?

well, i haven’t worked on language modeling before…so can’t say… one more thing, I read perplexity = 2 ^ cross_entropy_loss, my loss is around 7 but perplexity is showing ~ 700 (according to the code used in pytorch examples word level language model), --> they are using e instead of 2… i guess this is the standard