Hi
Consider
Model 1
RNNModel (
(drop): Dropout (p = 0.2), weights=(), parameters=0
(encoder): Embedding(10000, 200), weights=((10000, 200),), parameters=2000000
(rnn): LSTM(200, 200, num_layers=2, dropout=0.2), weights=((800, 200), (800, 200), (800,), (800,), (800, 200), (800, 200), (800,), (800,)), parameters=643200
(decoder): Linear (200 -> 10000), weights=((10000, 200), (10000,)), parameters=2010000
)
Its initial training log is:
10it [00:05, 1.69it/s]| epoch 1 | 10/ 1327 batches | lr 20.00 | ms/batch 654.76 | loss 9.57 | ppl 14373.84
20it [00:11, 1.75it/s]| epoch 1 | 20/ 1327 batches | lr 20.00 | ms/batch 574.20 | loss 7.41 | ppl 1653.78
30it [00:17, 1.67it/s]| epoch 1 | 30/ 1327 batches | lr 20.00 | ms/batch 598.62 | loss 7.23 | ppl 1374.03
40it [00:23, 1.79it/s]| epoch 1 | 40/ 1327 batches | lr 20.00 | ms/batch 557.17 | loss 7.18 | ppl 1312.40
50it [00:29, 1.72it/s]| epoch 1 | 50/ 1327 batches | lr 20.00 | ms/batch 580.40 | loss 6.94 | ppl 1029.27
60it [00:35, 1.67it/s]| epoch 1 | 60/ 1327 batches | lr 20.00 | ms/batch 615.28 | loss 7.00 | ppl 1097.51
70it [00:41, 1.74it/s]| epoch 1 | 70/ 1327 batches | lr 20.00 | ms/batch 584.67 | loss 6.83 | ppl 924.26
And now model 2:
LanguageModel (
(drop): Dropout (p = 0.2), weights=(), parameters=0
(encoder): Embedding(10000, 200), weights=((10000, 200),), parameters=2000000
(rnn): RecurrentHighwayNetwork (
(highway_layers): ModuleList (
(0): HighwayLayer (
(plain_layer): Linear (200 -> 200)
(transform_layer): Linear (200 -> 200)
)
(1): HighwayLayer (
(plain_layer): Linear (200 -> 200)
(transform_layer): Linear (200 -> 200)
)
)
), weights=((200, 200), (200,), (200, 200), (200,), (200, 200), (200,), (200, 200), (200,)), parameters=160800
(decoder): Linear (200 -> 10000), weights=((10000, 200), (10000,)), parameters=2010000
)
whose training log is:
10it [00:08, 1.02it/s]| epoch 1 | 10/ 1327 batches | lr 20.00 | ms/batch 956.93 | loss 8.98 | ppl 7922.88
20it [00:23, 1.66s/it]| epoch 1 | 20/ 1327 batches | lr 20.00 | ms/batch 1598.08 | loss 7.35 | ppl 1549.42
30it [00:45, 2.38s/it]| epoch 1 | 30/ 1327 batches | lr 20.00 | ms/batch 2300.54 | loss 7.12 | ppl 1237.13
40it [01:15, 3.05s/it]| epoch 1 | 40/ 1327 batches | lr 20.00 | ms/batch 2978.19 | loss 7.10 | ppl 1216.49
50it [01:51, 3.81s/it]| epoch 1 | 50/ 1327 batches | lr 20.00 | ms/batch 3743.11 | loss 6.95 | ppl 1046.57
60it [02:35, 4.53s/it]| epoch 1 | 60/ 1327 batches | lr 20.00 | ms/batch 4426.49 | loss 6.92 | ppl 1010.45
-
Why is there so much difference in milliseconds per batch? They have the same number of parameters (almost).
-
In the second model, ms/batch is increasing? The operations are the same, so why increase in ms/batch?
Model 2 is training too slowy despite being just an extension of model 1.