I have created a GCN-CNN regression model, the main aim is to run the model on a drug database and identify drugs with lowest binding affinity to a protein sequence. The model acquired reasonably good MSE loss during training and made pretty good predictions on test data.
When I used the model on a drug dataset to make predictions, the problem is some of its predictions are negative, I think its perhaps fine for a regression model to output negative predictions, and they are not many (just about 50-60 out of over 1 lac predictions in the drug database). But since my objective is to identify the drugs with lowest binding values, do these negative values mean the predicted binding affinities are very low and actually among the lowest? since these are on top when I sorted the prediction list so should I select them or are these some random predictions and I should ignore them and select the lowest values that are in positive?
I tried getting rid of this using ReLU but that obviously turned these values into 0, another issue is if I use this data in publication, it might look odd if I cite negative values predicted by out model, or even 0.
Hi Uzair!
Short story: Consider training your regression model to predict the log of
your binding affinities. Binding affinities that are (very) close to zero will map
to log-affinities that are (large) negative numbers, so the fact that your
regression can predict negative values will now be logically consistent.
Yes, it is perfectly reasonable for a regression model to predict a negative
value (that is close to zero) when the correct value would have been a
positive value (that is close to zero).
Probably yes. (Some other problem could be throwing things off, but the
mere fact that the predicted value is negative is not a red flag.)
Again, probably yes.
However, think about what a binding affinity means. You might consider
a binding affinity of 0.09 to be only modestly weaker than a binding affinity
of 0.10 (the difference being 0.01). But you might consider a binding affinity
of 0.000001 to be a lot weaker than a binding affinity of 0.001, even though
it’s “only” about 0.001 smaller.
If you look at things this way, then performing a regression on log-affinities
(using a mean-squared-error loss) is logically the natural thing to do.
Best.
K. Frank