Extending existing dataset

nhaH-luaP · April 16, 2022, 2:00pm

Hello everyone,
(It’s my first post on this website so please correct me if I’m doing anything wrong).
I am currently working on an Active Learning Cycle classifying MNIST-pictures. To keep track of the data I am using index-lists and create a new Subset for each cycle for training my NN. Now i want to improve my performance and use pseudo labeld pictures to increase my training pool size. Getting those samples and their pseudo label shouldn’t be the problem but i m missing the idea of how to add them to my training procedure. Is there an easy way to add them to the mnist training data set manually? Or should i create a new dataset and combine them or what do you think is best?

Greetings,
luaP

nivek · April 18, 2022, 2:18pm

If you have a separate Dataset of psuedo-labeled pictures, you should be able to combine them with the existing MNIST Dataset by using ConcatDataset. Let me know if I am not understanding you correctly.

nhaH-luaP · April 19, 2022, 9:19am

That sounds like a valid plan, thank you for your reply. A follow up question would be, if thats the most efficient way, even if I’m creating the pseudo labeld pictures during a cycle. Like is it maybe also possible - and if, more efficient - to just directly add the samples to the exisiting dataset without creating a new one just to use the ConcatDataset?

nivek · April 19, 2022, 3:18pm

I am not sure how you are creating your pseudo labeled pictures, but it is possible to create additional samples during the iteration, specifically in __getitem__ or __iter__ of your Dataset, similar to how transformation is applied.

You can return both the original sample and labeled sample from __getitem__ if you wish, assuming the rest of your code handles that behavior properly. You may have to change your collation_fn and etc.