Pseudo Labelling for self-training

jwngo · October 7, 2020, 3:16pm

Hi,

I am trying to implement pseudo labelling for self-training.

I have a trained model which I use to run evaluations on unlabeled data. After getting the logits, i softmax them which gives me a tensor of [7x368x640].

However, when i do torch.save() to save these predictions for future training with a model with both labeled and unlabeled dataset, I realise that these files are 76M for just 1 image.

This is too large for a dataset >100k images.

Is there another way I should implement pseudolabelling?

ptrblck · October 10, 2020, 4:42am

Each of these tensors should take approx. 6.3MB as seen here:

x = torch.randn(7, 368, 640)
print('expected size ', x.nelement() * 4 / 1024**2, 'MB')
> expected size  6.2890625 MB
torch.save(x, 'tmp.pt')

print('file size ', os.stat('tmp.pt').st_size / 1024**2, 'MB')
> file size  6.289786338806152 MB

Even if you could reduce the size by zipping the data, you would still need to process it later and thus unzip it.
Based on the shape it seems you won’t be able to compress the outputs using an image format.