Handling data loading in active learning

Hi, I am trying to implement a (pool-based) active learning scenario. I start with L, a small set of labeled samples (images, let’s say), and a lot of unlabeled ones. At each iteration I train the model on L, then rank the unlabeled samples in terms of their “usefulness”, label the top N most useful ones, add them to L, and repeat. My question is - what would be the best way to handle loading the training data since the size of L keeps increasing throughout training?

I can see two options:

  1. Have all the data (labeled and unlabeled) in the same DataLoader (with placeholder labels for the initially unlabeled samples). Keep track of what gets labeled, and mask out the unlabeled samples during training. This sounds incredibly wasteful when n_unlabeled >> n_labeled.

  2. Create a new labeled DataLoader at the end of each iteration, once the selected N samples have been newly labeled and added to the L set. That’s what I am leaning towards, but… is there any better way? What if I was doing stream-based active learning, I would not want to recreate the DataLoader each time I label a new sample :-/

Thanks!

1 Like

I would go for the second approach.
Note that creating a DataLoader instance is cheap. If you are also lazily loading the data, the creation of your Dataset should also be of no concern.

1 Like