Splitting data with percentage information

pastelDiplo · May 22, 2020, 10:00am

Hello, everyone!

The data I use to train model are intertwined. For example, an image belogs to the A folder with a rate of 15% and 85% to the B folder. There is no clear distinction. How can I separate my data in this case? How should I create folders for train and val so that I can use percentage information in the data?

Thank you!

Kushaj · May 22, 2020, 5:43pm

If you have two classes 0, 1. Now if an image belongs 15% to 0 and 85% to 1, you can interpolate the output class as 0*.15 + 1*.85 which is same as using the probability of it belonging to class 1 as the class label.

pastelDiplo · May 22, 2020, 7:24pm

Thank you for your reply. Still, there is an unclear part for me. Let’s say I’ve two classes: A and B. Could you please tell me if I should name folders differently (e.g.: First folder: A15B85, Second folder: A30B70) to interpret this percentage information or simply name the folders as “A” and “B”?

Kushaj · May 22, 2020, 9:00pm

For this task you cannot use that style of dataset. Instead store all the images in a single folder. Now how you get the information that it belongs to 85% folder B depends on you.

You can have this info in image file names, or create a separate csv file matching each filename to prob.

Then you would have to create a new dataset class.