Some issues when I train a model to classify DOG and Not DOG

Suppose I have a dataset that contains ten kinds of animals, such as dog , cat , etc, each with 1000 pictures. It is a goot dataset and each class has the same number of samles. However, I am not going to do a ten-class animals classification task. But to do a two-class classfification task. For example, I want to train a model to classify DOG and Not DOG. I put the pictures of DOG into one class, and the pictures of Not DOG into another class. This becomes a data imbalance problem. My question is, is it necessary for me to up-sample the DOG pictures to make the samples balanced?

Hi Zzuczy!

Yes, almost certainly. With a nine-to-one Not-DOG / DOG ratio your
model could achieve 90% accuracy just by classifying all samples as
Not-DOG.

So (assuming 10% DOG samples in your training set) you would
want to sample any given DOG sample 9 times as often as any
given Not-DOG sample. (If you’re working with batches it would be
helpful to have each batch be approximately half-and-half DOG and
Not-DOG samples.)

You could also use BCEWithLogitsLoss's pos_weight constructor
argument (set equal to 9.0, with DOG being your “positive” class),
but I would only recommend this if the size of your batches and your
training dataset is such that up-sampling causes a typical batch to
contain multiple duplicate DOG samples (which I do not think will be
true for you).

Best.

K. Frank

1 Like

Thanks for your valuable reply, KFrank.
Actually, If this is a common two-class classification, such as DOG/CAT classification, there is no doubt for me to up-sample the DOG pictures to make the numbers for DOG and CAT same. Come back to the DOG/Not DOG task, intuitively, I up-sampled the DOG pictures 9 times and on the contrary,the DOG class was optimized faster than Not DOG class. It’s confusing and interesting! And then, I su-sampled the DOG class 4 or 5 times, it seemed that DOG/Not DOG were optimized Synchronously.
I try to explain such experimental result. In the task of DOG/CAT classification, intra-class samples have similar distribution and they push the dicision boundery to another class in almost the same direction. However, there is a little different in the task of DOG/Not DOG task I think. In such a task, CAT and other ANIMALS are treating as negtive class together, but they still have different data distribution. They push the decision boundery to pos class. Some of them are resultant force,but not all of them. So when up-sample the DOG pictures 9 times, the optimization of DOG class is more dominant, because the samples in DOG class are all resultant force.
I’d like to hear what you think of the problem. Thanks again for your reply. Hope to continue the disscusion with you!

BTW, why did you say

but I would only recommend this if the size of your batches and your
training dataset is such that up-sampling causes a typical batch to
contain multiple duplicate DOG samples

Hi Zzuczy!

Consider a batch of ten samples consisting of five copies of sample
DOG-1 and one copy each of samples Not-DOG-1, Not-DOG-2, …,
Not-DOG-5, with each of the ten samples weighted equally. This
batch contains (of course) duplicate DOG samples.

Now consider a batch of six samples, DOG-1, Not-DOG-1, Not-DOG-2,
…, Not-DOG-5, with a weight of 5.0 for the DOG-1 sample, and a
weight of 1.0 for each of the Not-DOG samples. Such a weighting
can be achieved by using pos_weight = 5.0.

Two questions: What will be the relationship between the two losses
computed for these two batches? And which batch is likely to be
cheaper to process (both forward and backward)?

Best.

K. Frank

1 Like

It seems like that the two losses computed for these two batches are equal and the second batch is much cheaper to process because it don’t need to compute the duplicate DOG samples. Am I right? So, how do you suggest to deal with this problem? Since it seems like that you don‘t’ like these two batches either.