Precision doesn't improve when training on custom dataset

I use the pytorch imagenet example on a custom dataset like this:
python main.py --arch=alexnet dataset/
My dataset has nearly 300 categories, and 12000 images totally. The dataset is organized in train and val directories.
The portions of the training output are below, and from the output, you can see that the top-1 and top-5 precision nearly don’t change and improve, always equals Prec@1 0.333 Prec@5 1.667.
So I just wonder why this happens?

> creating model 'alexnet'
Epoch: [0][0/29]	Time 11.987 (11.987)	Data 10.121 (10.121)	Loss 6.9067 (6.9067)	Prec@1 0.391 (0.391)	Prec@5 0.781 (0.781)
Epoch: [0][10/29]	Time 0.336 (2.764)	Data 0.266 (2.488)	Loss 6.8902 (6.9003)	Prec@1 0.000 (0.178)	Prec@5 1.172 (1.598)
Epoch: [0][20/29]	Time 7.898 (2.771)	Data 7.827 (2.578)	Loss 6.8422 (6.8640)	Prec@1 0.000 (0.186)	Prec@5 1.953 (1.581)
Test: [0/10]	Time 4.700 (4.700)	Loss 6.7994 (6.7994)	Prec@1 0.000 (0.000)	Prec@5 3.125 (3.125)
 * Prec@1 0.333 Prec@5 1.625
Epoch: [1][0/29]	Time 2.927 (2.927)	Data 2.847 (2.847)	Loss 6.8089 (6.8089)	Prec@1 0.000 (0.000)	Prec@5 2.734 (2.734)
Epoch: [1][10/29]	Time 0.192 (0.822)	Data 0.025 (0.681)	Loss 6.7899 (6.7945)	Prec@1 0.391 (0.320)	Prec@5 1.953 (1.776)
Epoch: [1][20/29]	Time 2.253 (0.824)	Data 2.183 (0.689)	Loss 6.4336 (6.7144)	Prec@1 0.391 (0.316)	Prec@5 3.516 (1.730)
Test: [0/10]	Time 3.146 (3.146)	Loss 6.0892 (6.0892)	Prec@1 0.000 (0.000)	Prec@5 0.000 (0.000)
 * Prec@1 0.333 Prec@5 1.667
Epoch: [2][0/29]	Time 3.009 (3.009)	Data 2.920 (2.920)	Loss 6.0913 (6.0913)	Prec@1 0.391 (0.391)	Prec@5 1.953 (1.953)
Epoch: [2][10/29]	Time 0.189 (0.836)	Data 0.000 (0.681)	Loss 6.0209 (6.0952)	Prec@1 0.391 (0.320)	Prec@5 0.391 (1.562)
Epoch: [2][20/29]	Time 2.251 (0.822)	Data 2.181 (0.680)	Loss 5.9183 (6.0205)	Prec@1 0.000 (0.223)	Prec@5 0.781 (1.302)
Test: [0/10]	Time 3.046 (3.046)	Loss 5.9031 (5.9031)	Prec@1 0.000 (0.000)	Prec@5 0.000 (0.000)
 * Prec@1 0.333 Prec@5 1.667
Epoch: [46][0/29]	Time 2.996 (2.996)	Data 2.915 (2.915)	Loss 5.7088 (5.7088)	Prec@1 0.000 (0.000)	Prec@5 0.781 (0.781)
Epoch: [46][10/29]	Time 0.188 (0.844)	Data 0.000 (0.696)	Loss 5.7168 (5.7085)	Prec@1 0.000 (0.178)	Prec@5 1.562 (1.705)
Epoch: [46][20/29]	Time 2.090 (0.828)	Data 2.011 (0.685)	Loss 5.7267 (5.7122)	Prec@1 0.000 (0.205)	Prec@5 0.781 (1.562)
Test: [0/10]	Time 3.080 (3.080)	Loss 5.7117 (5.7117)	Prec@1 0.000 (0.000)	Prec@5 0.000 (0.000)
 * Prec@1 0.333 Prec@5 1.667
Epoch: [47][0/29]	Time 2.943 (2.943)	Data 2.852 (2.852)	Loss 5.7018 (5.7018)	Prec@1 0.781 (0.781)	Prec@5 3.125 (3.125)
Epoch: [47][10/29]	Time 0.196 (0.852)	Data 0.000 (0.701)	Loss 5.7113 (5.7091)	Prec@1 0.781 (0.355)	Prec@5 1.953 (1.953)
Epoch: [47][20/29]	Time 2.221 (0.838)	Data 2.136 (0.695)	Loss 5.7153 (5.7120)	Prec@1 0.000 (0.260)	Prec@5 1.562 (1.656)
Test: [0/10]	Time 3.071 (3.071)	Loss 5.7107 (5.7107)	Prec@1 0.000 (0.000)	Prec@5 0.000 (0.000)
 * Prec@1 0.333 Prec@5 1.667
Epoch: [48][0/29]	Time 3.054 (3.054)	Data 2.978 (2.978)	Loss 5.7045 (5.7045)	Prec@1 0.391 (0.391)	Prec@5 2.344 (2.344)
Epoch: [48][10/29]	Time 0.182 (0.837)	Data 0.000 (0.689)	Loss 5.7104 (5.7084)	Prec@1 0.391 (0.249)	Prec@5 2.734 (1.847)
Epoch: [48][20/29]	Time 1.824 (0.819)	Data 1.753 (0.700)	Loss 5.7171 (5.7120)	Prec@1 0.391 (0.242)	Prec@5 0.781 (1.488)
Test: [0/10]	Time 3.084 (3.084)	Loss 5.7100 (5.7100)	Prec@1 0.000 (0.000)	Prec@5 3.125 (3.125)
 * Prec@1 0.333 Prec@5 1.667
Epoch: [49][0/29]	Time 3.213 (3.213)	Data 3.137 (3.137)	Loss 5.7120 (5.7120)	Prec@1 0.781 (0.781)	Prec@5 1.172 (1.172)
Epoch: [49][10/29]	Time 0.182 (0.869)	Data 0.000 (0.713)	Loss 5.7154 (5.7094)	Prec@1 0.000 (0.426)	Prec@5 0.781 (1.456)
Epoch: [49][20/29]	Time 2.013 (0.829)	Data 1.931 (0.696)	Loss 5.7096 (5.7113)	Prec@1 0.781 (0.316)	Prec@5 2.734 (1.376)
Test: [0/10]	Time 3.072 (3.072)	Loss 5.7060 (5.7060)	Prec@1 0.000 (0.000)	Prec@5 0.000 (0.000)
 * Prec@1 0.333 Prec@5 1.667

Lack of convergence can be caused by a lot of things. The optimization is not guaranteed to succeed and actually it’s a bit of a miracle that NNs even work. We can’t help you a lot on that, but I think there are some problems with your dataset or transforms that are applied to it.

I can train this dataset with resnet18:
python main.py --arch=resnet18 --batch-size=128 dataset
The output after 90 epoch:

Epoch: [89][0/57]	Time 1.645 (1.645)	Data 1.525 (1.525)	Loss 0.3686 (0.3686)	Prec@1 92.969 (92.969)	Prec@5 97.656 (97.656)
Epoch: [89][10/57]	Time 0.364 (0.528)	Data 0.000 (0.244)	Loss 0.5915 (0.5153)	Prec@1 87.500 (87.713)	Prec@5 93.750 (95.455)
Epoch: [89][20/57]	Time 0.510 (0.471)	Data 0.379 (0.190)	Loss 0.6496 (0.5262)	Prec@1 85.938 (87.240)	Prec@5 92.969 (95.126)
Epoch: [89][30/57]	Time 0.355 (0.453)	Data 0.000 (0.171)	Loss 0.4592 (0.5180)	Prec@1 90.625 (87.903)	Prec@5 95.312 (95.186)
Epoch: [89][40/57]	Time 0.536 (0.446)	Data 0.413 (0.165)	Loss 0.3770 (0.5072)	Prec@1 92.188 (88.529)	Prec@5 97.656 (95.332)
Epoch: [89][50/57]	Time 0.369 (0.440)	Data 0.000 (0.166)	Loss 0.4453 (0.5025)	Prec@1 89.844 (88.664)	Prec@5 95.312 (95.374)
Test: [0/19]	Time 1.668 (1.668)	Loss 0.8600 (0.8600)	Prec@1 81.250 (81.250)	Prec@5 94.531 (94.531)
Test: [10/19]	Time 0.104 (0.463)	Loss 1.5666 (1.5452)	Prec@1 67.188 (67.827)	Prec@5 84.375 (84.659)
 * Prec@1 67.375 Prec@5 84.208

However, if train this dataset using alexnet with learning rate 0.01:
python main.py --arch=alexnet --lr=0.01 dataset
The output after 90 epoch:

Epoch: [89][0/29]	Time 3.110 (3.110)	Data 3.040 (3.040)	Loss 4.7523 (4.7523)	Prec@1 5.469 (5.469)	Prec@5 19.922 (19.922)
Epoch: [89][10/29]	Time 0.189 (0.831)	Data 0.070 (0.700)	Loss 4.7577 (4.8041)	Prec@1 6.250 (5.611)	Prec@5 19.141 (17.685)
Epoch: [89][20/29]	Time 2.163 (0.831)	Data 2.079 (0.705)	Loss 4.8331 (4.8019)	Prec@1 4.688 (5.673)	Prec@5 19.531 (17.839)
Test: [0/10]	Time 3.048 (3.048)	Loss 4.6815 (4.6815)	Prec@1 8.203 (8.203)	Prec@5 23.047 (23.047)
 * Prec@1 7.458 Prec@5 22.833

You can see with alexnet, after 90 epoch, the precision@1 is 7.458; with resnet18, after 90 epoch, the precision@ is 67.375.

So it is a bit strange to see that the precision of alexnet grows so slowly.

It’s probably because the AlexNet doesn’t use BatchNorm and it stabilizes the training. I’d look into data normalization, you might be doing something wrong there.

I faced the same problem with you.
In you training, make sure you used optimizer.zero_grad(), and then after compute loss, you need to add
a code :loss.backward()
hope this help you