I have a module that is composed of 3 submodules, and then 1 module that combines the output of these 3 submodules.
The input to the 3 submodules is 1 vector of metadata, and 2 time series.
I also train the model with various sequence lengths.
def __init__(
self,
columns,
):
super().__init__()
self.columns = columns
layers_n = [64,64,32,24]
self.target_coin_meta_module = torch.nn.Sequential(
torch.nn.Linear(len(columns), layers_n[0]),
torch.nn.ReLU(),
*map(
lambda a: torch.nn.Sequential(
torch.nn.Linear(layers_n[a-1], layers_n[a]),
torch.nn.ReLU(),
),
range(1, len(layers_n)),
)
)
self.target_market_module = modules.CoinMarketModule(
[
torch.nn.LSTM(input_size=3, hidden_size=24, bidirectional=False, num_layers=3, dropout=0, batch_first=True),
],
)
self.marker_market_module = modules.CoinMarketModule(
[
torch.nn.LSTM(input_size=3, hidden_size=24, bidirectional=False, num_layers=3, dropout=0, batch_first=True),
],
)
self.combination_module = torch.nn.Sequential(
torch.nn.Linear(3*24, 64),
torch.nn.ReLU(),
torch.nn.Linear(64, 48),
torch.nn.ReLU(),
torch.nn.Linear(48, 24),
torch.nn.ReLU(),
torch.nn.Linear(24, 2)
)
The output is the mean and logvar of a normal distribution, while the comparison target is the following value of the 1st of the time series to predict.
So the loss I use is the negative log likelihood for the comparison target to occur in the predicted distribution, averaged over the samples of the batch.
def criterion(y_, y):
return meth.nll(y_[:,0:1], y_[:,1:2], y).mean()
In two scenarios, the gradient of just the first module, the metadata module (target_coin_meta_module), turns into zero, while the remaining gradients stay non-0.
I save and load this module using
torch.save(model.state_dict(), path)
and model.load_state_dict(torch.load(path))
respectively.
For some models, when I load the state dict, and try to do one more training step, the gradients of target_coin_meta_module, the first module only turn to 0, while the gradients of all the other modules are non-0.
That behavior only occurs randomly with some of the parameter combination of the module, sometimes it does not do that.
Even without the loading behavior, sometimes when I initialize the model and start to train it, the gradients are fine for the first few batches, but eventually the gradients for the first module turn into 0:
See grad_individual_sums below, which is the sum of the gradient vector of each of the model.parameters().
The first 8 refer to the first module’s 4 layer’s weights and biases, and in batch 2048, they have turned into 0, while the length of the gradient overall has become much larger, so that is clipped at the clipping value 5.
batch 0 seq_length: 50 loss: 0.9648630619049072
grad_individual_sums: [0.6213738918304443, -0.01167448703199625, -0.5094274282455444, -0.01605949178338051, -0.37891966104507446, -0.017566990107297897, -0.3544050455093384, -0.07019680738449097, -0.000197781904716976, -0.0004778598668053746, -0.0031790072098374367, -0.0031790072098374367, 0.0005944988806731999, 0.0012100754538550973, 0.0053168125450611115, 0.0053168125450611115, -0.0025589149445295334, -0.010053854435682297, -0.01283634640276432, -0.01283634640276432, 2.7745198167394847e-05, -4.8533282097196206e-05, -0.0003413408121559769, -0.0003413408121559769, -0.0004314566031098366, 0.0022087236866354942, -0.0037272844929248095, -0.0037272844929248095, -4.305329639464617e-05, 0.000992199289612472, -0.002092389389872551, -0.002092389389872551, -0.7942986488342285, -0.1766081154346466, -1.192203402519226, -0.29307132959365845, 0.48015737533569336, 0.4495493173599243, 1.6458654403686523, 2.112175464630127]
grad_norm: 2.2348684465130813
real:
tensor([[-1.52052746216e-06],
[ 2.48936476055e-06],
[ 1.51747406926e-03],
[ 5.66016365722e-11]], device='cuda:0')
predicted:
tensor([[0.04864747077, 0.04633797705],
[0.04855943471, 0.04586257786],
[0.04877693951, 0.04655547068],
[0.04876956716, 0.04827462509]], device='cuda:0',
grad_fn=<SliceBackward0>)
batch 2048 seq_length: 11 loss: -6.313509941101074
grad_individual_sums: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -6.71602610964328e-05, -0.000827555253636092, -0.0003659526410046965, -0.0003659526410046965, -8.661030733492225e-05, 8.195849659387022e-06, -6.375402881531045e-05, -6.375402881531045e-05, 0.00018928945064544678, -0.00019270165648777038, -0.00013690421474166214, -0.00013690421474166214, -1.1378756964441905e-10, -3.2832090823831095e-08, -7.914705335565486e-09, -7.914705335565486e-09, -1.4030726447344932e-07, 1.0826558138887776e-07, -3.545755333789202e-08, -3.545755333789202e-08, -6.845851021353155e-05, -4.480106144910678e-05, 2.294111982337199e-05, 2.294111982337199e-05, 0.04194265231490135, 0.01245787926018238, 0.21846213936805725, 0.006534038111567497, 0.22305536270141602, 0.006229810416698456, -6.864626884460449, -0.43354177474975586]
grad_norm: 4.999999038149247
real:
tensor([[-6.69499931973e-05],
[-2.98962518573e-02],
[-3.94897535443e-03],
[ 4.08112823050e-12]], device='cuda:0')
predicted:
tensor([[-0.00891025737, -8.88826942444],
[-0.00890567526, -8.88757991791],
[-0.00889842585, -8.88649272919],
[ 0.02337865904, -4.03638601303]], device='cuda:0',
grad_fn=<SliceBackward0>)
I can’t come up with a good explanation as to why this occurs.
Even when I turn off gradient clipping, the 0s still persist.
To note is, that the parameter values in the 2 LSTM modules seem to be almost an order of magnitude bigger than the parameter values in the first module.