I am trying to train a ResNet-18 on Imagenet, using the example provided here.
When running:
python imagenet_torch.py \
/path/to/imagenet \
-j 4 \
-b 128 \
-p 1
I get:
=> creating model 'resnet18'
Epoch: [0][ 1/10010] Time 5.000 ( 5.000) Data 2.659 ( 2.659) Loss 6.9674e+00 (6.9674e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)
Epoch: [0][ 2/10010] Time 0.133 ( 2.566) Data 0.083 ( 1.371) Loss 7.0575e+00 (7.0125e+00) Acc@1 0.00 ( 0.00) Acc@5 0.78 ( 0.39)
Epoch: [0][ 3/10010] Time 0.138 ( 1.757) Data 0.090 ( 0.944) Loss 7.0604e+00 (7.0285e+00) Acc@1 0.00 ( 0.00) Acc@5 0.78 ( 0.52)
Epoch: [0][ 4/10010] Time 0.132 ( 1.351) Data 0.085 ( 0.729) Loss 7.1079e+00 (7.0483e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.39)
Epoch: [0][ 5/10010] Time 0.129 ( 1.106) Data 0.081 ( 0.600) Loss 7.1516e+00 (7.0690e+00) Acc@1 0.00 ( 0.00) Acc@5 0.78 ( 0.47)
Epoch: [0][ 6/10010] Time 0.128 ( 0.943) Data 0.081 ( 0.513) Loss 7.2192e+00 (7.0940e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.39)
Epoch: [0][ 7/10010] Time 0.128 ( 0.827) Data 0.081 ( 0.451) Loss 7.1660e+00 (7.1043e+00) Acc@1 0.00 ( 0.00) Acc@5 0.78 ( 0.45)
Epoch: [0][ 8/10010] Time 0.128 ( 0.740) Data 0.081 ( 0.405) Loss 7.1355e+00 (7.1082e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.39)
Epoch: [0][ 9/10010] Time 0.292 ( 0.690) Data 0.244 ( 0.387) Loss 7.2154e+00 (7.1201e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.35)
Epoch: [0][ 10/10010] Time 0.799 ( 0.701) Data 0.752 ( 0.424) Loss 7.2732e+00 (7.1354e+00) Acc@1 0.00 ( 0.00) Acc@5 1.56 ( 0.47)
Epoch: [0][ 11/10010] Time 0.128 ( 0.649) Data 0.081 ( 0.392) Loss 7.2337e+00 (7.1443e+00) Acc@1 0.00 ( 0.00) Acc@5 0.78 ( 0.50)
Epoch: [0][ 12/10010] Time 0.200 ( 0.611) Data 0.153 ( 0.372) Loss 7.4204e+00 (7.1673e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.46)
Epoch: [0][ 13/10010] Time 0.770 ( 0.623) Data 0.723 ( 0.399) Loss 7.2316e+00 (7.1723e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.42)
Epoch: [0][ 14/10010] Time 0.918 ( 0.644) Data 0.871 ( 0.433) Loss 7.3226e+00 (7.1830e+00) Acc@1 0.78 ( 0.06) Acc@5 2.34 ( 0.56)
Epoch: [0][ 15/10010] Time 0.138 ( 0.611) Data 0.090 ( 0.410) Loss 7.3067e+00 (7.1913e+00) Acc@1 0.78 ( 0.10) Acc@5 1.56 ( 0.63)
Epoch: [0][ 16/10010] Time 0.128 ( 0.581) Data 0.081 ( 0.390) Loss 7.2077e+00 (7.1923e+00) Acc@1 0.00 ( 0.10) Acc@5 1.56 ( 0.68)
Epoch: [0][ 17/10010] Time 0.870 ( 0.598) Data 0.823 ( 0.415) Loss 7.2863e+00 (7.1978e+00) Acc@1 0.00 ( 0.09) Acc@5 0.78 ( 0.69)
Epoch: [0][ 18/10010] Time 1.377 ( 0.641) Data 1.329 ( 0.466) Loss 7.2275e+00 (7.1995e+00) Acc@1 0.00 ( 0.09) Acc@5 0.00 ( 0.65)
Epoch: [0][ 19/10010] Time 0.128 ( 0.614) Data 0.081 ( 0.446) Loss 7.2623e+00 (7.2028e+00) Acc@1 0.78 ( 0.12) Acc@5 1.56 ( 0.70)
Epoch: [0][ 20/10010] Time 0.129 ( 0.590) Data 0.081 ( 0.427) Loss 7.2357e+00 (7.2044e+00) Acc@1 0.78 ( 0.16) Acc@5 1.56 ( 0.74)
Epoch: [0][ 21/10010] Time 0.244 ( 0.573) Data 0.197 ( 0.416) Loss 7.1773e+00 (7.2031e+00) Acc@1 0.78 ( 0.19) Acc@5 1.56 ( 0.78)
So it takes at least 0.12s to run a batch of 128 (therefore at least 20 minutes to run a single epoch and 30 hours to train the model), with a large part of it being spent waiting for the next batch. Increasing the number of workers form 4 to anything does not change the problem despite my machine having multiple CPUs available.
I have noticed this behaviour on 2 different clusters.
While 30 hours does not sound like necessarily a lot, it is still for a small model.
Plus, most of it is because I am data-bound so surely something must be wrong.
It’s also way less than what was noticed by this other user, or at least I think.
Does anyone know if this is normal?