Beginner: Multiple Output's / Partial Dataset


I’m trying to build a model using PyTorch + pytorch_transformers to create it using BERT as pre-training step. The issue that I’ve different datasets, which all of them are on English, but they have partial intersected labels.

It’s possible to create a model that uses pre-trained BERT (or any other model), and feeds data from multiple datasets to predict multiple outputs?

Example, which I have 4 text datasets:
Dataset A contains [ ValueA, ValueB, ValueC ]
Dataset B contains [ ValueA, ValueB, ValueC, ValueD, ValueE, ValueF ]
Dataset C contains [ ValueA, ValueB ]
Dataset D contains [ ValueD, ValueE, ValueF ]

Since all of them are on English, I hope to use BERT to enchance the similarity between datasets.

Approaches that I thought:

  • Create a general y, and add 0. to empty fields which I don’t have for it. In this case, my prediction would be [ ValueA, ValueB, ValueC, ValueD, ValueE, ValueF ]

I’m not NLP expert, so take this with a grain of salt. :wink:

The use case sound generally like a multi-label classification, i.e. multiple classes can be active/inactive in the target.
One approach would be to use a linear layer as your model output returning [batch_size, nb_classes] and pass it, along a target containing zeros and ones of the same shape, to nn.BCEWithLogitsLoss.

Let me know, if this would work or if I’m thinking too naively about your use case.

1 Like