Hi Sally!
So just to confirm: Regardless of any intermediate steps taken in your
overall workflow (e.g., producing a fine-grained histogram or identifying
each individual gravel particle), your final desired goal is to infer a
five-fixed-bin probability distribution. Correct?
This strikes me as not the best way to go.
Your ground-truth is a (discrete) probability distribution and you wish
to predict a matching probability distribution. Cross-entropy is the
natural loss function for such a task. It is a natural measure of how
much two probability distributions differ, and root-mean-squared-error,
while not unreasonable, just isn’t as natural (and the lore is that it
doesn’t work as well for this kind of use case).
Note that pytorch’s CrossEntropyLoss
was designed with classification
in mind where you have an integer class label as your target
so it
doesn’t apply to your use case. But cross-entropy makes perfect sense
with a probabilistic target
(sometimes called “soft labels”).
You can easily write your own probabilistic cross-entropy, as outlined
in this post:
If you go this route, using Linear (32, 5)
as your final layer is the
right thing to do, but you will not want to follow it with a softmax()
layer. (Both pytorch’s CrossEntropyLoss
and the probabilistic
version I linked to have, for reasons of numerical stability, softmax()
built into them in the form of log_softmax()
.) That is, you want your
network to output the logits that correspond to your predicted
probabilities, rather than the probabilities themselves.
I can’t say that changing to a cross-entropy loss will necessarily work
great, but it is the first thing I would try.
Training a network that generalizes to inputs that are not entirely of
same character as the training set is an admirable goal, but is harder,
and not always achievable. Regularization techniques (e.g., weight
decay or dropout) do help reduce overfitting, and can help with
generalization to fundamentally different inputs, but are not a panacea.
I would recommend increasing the diversity of your training set to
more completely represent the character of the real-world images
that you will want to analyze “in the wild.”
(More, and better, and more fully representative data, although
potentially expensive, is usually the best first step towards improving
the practical performance of a network.)
Just to confirm, because your network works reasonably well on your
validation images, this is not an issue of overfitting, per se, but one
of generalization.
This is a reasonable number, but more data might well help. (Again,
the size of your images may help in the sense that different regions
of a given image might be sufficiently independent so as to count,
roughly speaking, as separate data samples.)
How much computing power do you have for this project? Can you
reasonably do a lot of long training runs on large models with large
datasets?
Are you downsampling because of hardware limitations, or are you
able to train with the full 1500x1500 images (or more modestly
downsampled images)? I imagine an image of gravel and could
believe that downsampling to 244x244 would lose lots of relevant
information.
Well, if you’ve already tried non-neural-network “particle analysis” and
it doesn’t work, I don’t have much to offer.
In this vein, however, if a person looks at one of your images, are
the individual grains of gravel readily apparent? Do most pixels in
your images belong to a grain of gravel, or do you have a significant
fraction of “background” pixels (e.g., maybe a piece of paper that
your gravel was spread out on)?
How often in your images is a grain of gravel partially occluded by
another grain?
One approach to “particle analysis” – to identify individual grains of
of gravel – would be to train (or fine tune) a segmentation network.
This could either be semantic segmentation where you label each
pixel as gravel body, vs., for example, gravel boundary and / or
background. Or it could be instance segmentation where the network
further infers that one set of “gravel pixels” belongs instance-A of a
gravel grain, and some other set belongs to instance-B.
The significant disadvantage of segmentation is that your training
images would need to be annotated with pixel-by-pixel labels, a
tedious, and potentially expensive task.
Would it be possible for you to post a representative image from your
training set? If the image is too large, posting a cropped, rather than
a downsampled image would be preferable.
Best.
K. Frank