2 Models with Similar Parameters/architectures but very different training rates?

Hi

Consider

Model 1

RNNModel (
(drop): Dropout (p = 0.2), weights=(), parameters=0
(encoder): Embedding(10000, 200), weights=((10000, 200),), parameters=2000000
(rnn): LSTM(200, 200, num_layers=2, dropout=0.2), weights=((800, 200), (800, 200), (800,), (800,), (800, 200), (800, 200), (800,), (800,)), parameters=643200
(decoder): Linear (200 -> 10000), weights=((10000, 200), (10000,)), parameters=2010000
)

Its initial training log is:

10it [00:05, 1.69it/s]| epoch 1 | 10/ 1327 batches | lr 20.00 | ms/batch 654.76 | loss 9.57 | ppl 14373.84
20it [00:11, 1.75it/s]| epoch 1 | 20/ 1327 batches | lr 20.00 | ms/batch 574.20 | loss 7.41 | ppl 1653.78
30it [00:17, 1.67it/s]| epoch 1 | 30/ 1327 batches | lr 20.00 | ms/batch 598.62 | loss 7.23 | ppl 1374.03
40it [00:23, 1.79it/s]| epoch 1 | 40/ 1327 batches | lr 20.00 | ms/batch 557.17 | loss 7.18 | ppl 1312.40
50it [00:29, 1.72it/s]| epoch 1 | 50/ 1327 batches | lr 20.00 | ms/batch 580.40 | loss 6.94 | ppl 1029.27
60it [00:35, 1.67it/s]| epoch 1 | 60/ 1327 batches | lr 20.00 | ms/batch 615.28 | loss 7.00 | ppl 1097.51
70it [00:41, 1.74it/s]| epoch 1 | 70/ 1327 batches | lr 20.00 | ms/batch 584.67 | loss 6.83 | ppl 924.26

And now model 2:

LanguageModel (
(drop): Dropout (p = 0.2), weights=(), parameters=0
(encoder): Embedding(10000, 200), weights=((10000, 200),), parameters=2000000
(rnn): RecurrentHighwayNetwork (
(highway_layers): ModuleList (
(0): HighwayLayer (
(plain_layer): Linear (200 -> 200)
(transform_layer): Linear (200 -> 200)
)
(1): HighwayLayer (
(plain_layer): Linear (200 -> 200)
(transform_layer): Linear (200 -> 200)
)
)
), weights=((200, 200), (200,), (200, 200), (200,), (200, 200), (200,), (200, 200), (200,)), parameters=160800
(decoder): Linear (200 -> 10000), weights=((10000, 200), (10000,)), parameters=2010000
)

whose training log is:

10it [00:08, 1.02it/s]| epoch 1 | 10/ 1327 batches | lr 20.00 | ms/batch 956.93 | loss 8.98 | ppl 7922.88
20it [00:23, 1.66s/it]| epoch 1 | 20/ 1327 batches | lr 20.00 | ms/batch 1598.08 | loss 7.35 | ppl 1549.42
30it [00:45, 2.38s/it]| epoch 1 | 30/ 1327 batches | lr 20.00 | ms/batch 2300.54 | loss 7.12 | ppl 1237.13
40it [01:15, 3.05s/it]| epoch 1 | 40/ 1327 batches | lr 20.00 | ms/batch 2978.19 | loss 7.10 | ppl 1216.49
50it [01:51, 3.81s/it]| epoch 1 | 50/ 1327 batches | lr 20.00 | ms/batch 3743.11 | loss 6.95 | ppl 1046.57
60it [02:35, 4.53s/it]| epoch 1 | 60/ 1327 batches | lr 20.00 | ms/batch 4426.49 | loss 6.92 | ppl 1010.45

  1. Why is there so much difference in milliseconds per batch? They have the same number of parameters (almost).

  2. In the second model, ms/batch is increasing? The operations are the same, so why increase in ms/batch?

Model 2 is training too slowy despite being just an extension of model 1.

Are you sure that you are resetting the timer? I would also check input’s size and the number of model’s parameters after each batch, to be sure nothing is increasing.

My guess is that the RNNModel hidden state is detached, repackaged, or reinitialised in between batches, whereas the RecurrentHighwayNetwork’s hidden state is simply kept from one batch to the next, in which case you are backpropagating all the way back to the beginning of the first batch.

No, in my model too I create a new hidden variable with previous hidden data in every batch:

But just look at my training logs:

| epoch 1 | 10/ 1327 batches | lr 20.00 | ms/batch 4014.78 | loss 8.95 | ppl 7732.07
| epoch 1 | 20/ 1327 batches | lr 20.00 | ms/batch 3434.41 | loss 7.36 | ppl 1565.54
| epoch 1 | 30/ 1327 batches | lr 20.00 | ms/batch 3776.96 | loss 7.18 | ppl 1307.52
| epoch 1 | 40/ 1327 batches | lr 20.00 | ms/batch 3311.54 | loss 7.08 | ppl 1182.57
| epoch 1 | 50/ 1327 batches | lr 20.00 | ms/batch 3227.49 | loss 6.95 | ppl 1046.28
| epoch 1 | 60/ 1327 batches | lr 20.00 | ms/batch 3226.90 | loss 6.95 | ppl 1042.19
| epoch 1 | 70/ 1327 batches | lr 20.00 | ms/batch 2971.59 | loss 6.81 | ppl 904.40
| epoch 1 | 80/ 1327 batches | lr 20.00 | ms/batch 3106.82 | loss 6.87 | ppl 962.97
| epoch 1 | 90/ 1327 batches | lr 20.00 | ms/batch 3352.58 | loss 6.78 | ppl 883.79
| epoch 1 | 100/ 1327 batches | lr 20.00 | ms/batch 3032.48 | loss 6.77 | ppl 870.15
| epoch 1 | 110/ 1327 batches | lr 20.00 | ms/batch 2901.16 | loss 6.78 | ppl 878.80
| epoch 1 | 120/ 1327 batches | lr 20.00 | ms/batch 3002.09 | loss 6.72 | ppl 831.85
| epoch 1 | 130/ 1327 batches | lr 20.00 | ms/batch 2947.38 | loss 6.74 | ppl 849.17
| epoch 1 | 140/ 1327 batches | lr 20.00 | ms/batch 2962.97 | loss 6.77 | ppl 873.62
| epoch 1 | 150/ 1327 batches | lr 20.00 | ms/batch 3008.20 | loss 6.74 | ppl 844.83
| epoch 1 | 160/ 1327 batches | lr 20.00 | ms/batch 2940.90 | loss 6.71 | ppl 823.43
| epoch 1 | 170/ 1327 batches | lr 20.00 | ms/batch 3062.02 | loss 6.77 | ppl 868.79
| epoch 1 | 180/ 1327 batches | lr 20.00 | ms/batch 3022.68 | loss 6.78 | ppl 881.30
| epoch 1 | 190/ 1327 batches | lr 20.00 | ms/batch 2979.49 | loss 6.65 | ppl 776.37
| epoch 1 | 200/ 1327 batches | lr 20.00 | ms/batch 2969.84 | loss 6.66 | ppl 782.79
| epoch 1 | 210/ 1327 batches | lr 20.00 | ms/batch 2922.02 | loss 6.73 | ppl 840.52
| epoch 1 | 220/ 1327 batches | lr 20.00 | ms/batch 3018.08 | loss 6.64 | ppl 766.08
| epoch 1 | 230/ 1327 batches | lr 20.00 | ms/batch 3340.55 | loss 6.66 | ppl 776.91
| epoch 1 | 240/ 1327 batches | lr 20.00 | ms/batch 2847.63 | loss 6.65 | ppl 773.44
| epoch 1 | 250/ 1327 batches | lr 20.00 | ms/batch 2953.34 | loss 6.71 | ppl 818.55
| epoch 1 | 260/ 1327 batches | lr 20.00 | ms/batch 2964.03 | loss 6.68 | ppl 799.16
| epoch 1 | 270/ 1327 batches | lr 20.00 | ms/batch 2933.35 | loss 6.70 | ppl 813.85
| epoch 1 | 280/ 1327 batches | lr 20.00 | ms/batch 2907.14 | loss 6.68 | ppl 794.25
| epoch 1 | 290/ 1327 batches | lr 20.00 | ms/batch 3008.06 | loss 6.67 | ppl 790.62
| epoch 1 | 300/ 1327 batches | lr 20.00 | ms/batch 2966.33 | loss 6.59 | ppl 725.52
| epoch 1 | 310/ 1327 batches | lr 20.00 | ms/batch 2951.73 | loss 6.57 | ppl 712.75
| epoch 1 | 320/ 1327 batches | lr 20.00 | ms/batch 2856.64 | loss 6.51 | ppl 672.87
| epoch 1 | 330/ 1327 batches | lr 20.00 | ms/batch 3065.96 | loss 6.63 | ppl 760.79
| epoch 1 | 340/ 1327 batches | lr 20.00 | ms/batch 3290.83 | loss 6.70 | ppl 813.69
| epoch 1 | 350/ 1327 batches | lr 20.00 | ms/batch 3535.03 | loss 6.76 | ppl 859.84
| epoch 1 | 360/ 1327 batches | lr 20.00 | ms/batch 4087.89 | loss 6.65 | ppl 774.28
| epoch 1 | 370/ 1327 batches | lr 20.00 | ms/batch 5108.70 | loss 6.66 | ppl 778.76
| epoch 1 | 380/ 1327 batches | lr 20.00 | ms/batch 6547.46 | loss 6.65 | ppl 773.92
| epoch 1 | 390/ 1327 batches | lr 20.00 | ms/batch 7921.22 | loss 6.65 | ppl 774.29
| epoch 1 | 400/ 1327 batches | lr 20.00 | ms/batch 9174.89 | loss 6.63 | ppl 754.19
| epoch 1 | 410/ 1327 batches | lr 20.00 | ms/batch 10519.90 | loss 6.66 | ppl 777.76
| epoch 1 | 420/ 1327 batches | lr 20.00 | ms/batch 11718.43 | loss 6.66 | ppl 781.59
| epoch 1 | 430/ 1327 batches | lr 20.00 | ms/batch 12073.58 | loss 6.58 | ppl 721.32
| epoch 1 | 440/ 1327 batches | lr 20.00 | ms/batch 12529.14 | loss 6.64 | ppl 766.29
| epoch 1 | 450/ 1327 batches | lr 20.00 | ms/batch 14175.94 | loss 6.63 | ppl 755.54
| epoch 1 | 460/ 1327 batches | lr 20.00 | ms/batch 14835.17 | loss 6.63 | ppl 758.95
| epoch 1 | 470/ 1327 batches | lr 20.00 | ms/batch 16415.23 | loss 6.63 | ppl 757.16
| epoch 1 | 480/ 1327 batches | lr 20.00 | ms/batch 17147.11 | loss 6.55 | ppl 700.03

ms/batch is increasing and increasing…!!

The model is the same, same parameters, same everything. Really slow training.

Nothing is increasing…I had checked that earlier. SAme input side, same output size… same hidden var size

4 fold increase in ms/batch… despite repackaging hidden variable

Can you post a minimal code?