Let’s assume that we want to develop a model to count the number of words in an audio sample.
How one would approach the last layer and the loss? I have thought of the following ideas, but perhaps there is a better approach to this.

Use a linear layer with N dimensions and a softmax activation and crossentropy loss. This makes it so that each amount is a class. The problem is that when the truth is 4 words, predicting 3 words or predicting 20 words is equally wrong.

Use a linear layer with 1 dimension and relu activation and MSE loss. This seems like the most direct approach and perhaps is just this simple.

Use linear layer with N dimensions and softmax activation and KL loss. In this case, we can label our data so that instead of being one count, you distribute the probabilities to neighbor counts. For example, for a max of 10 counts, the count 5 can be labelled as
[0, 0, 0, 0.2, 0.6, 0.2, 0, 0, 0, 0, 0]
instead of[0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
. In this case, predicting a 4 or a 6 is less wrong than predicting other values.
One has to try out to see what works best, but was wondering if anyone else has found any of this or another approach best.