RNN model embeddings in preprocessing or in model

alexgagnon · November 22, 2023, 3:20am

Hi,

I’m trying to make a basic RNN using glove for embeddings. Currently I’m thinking I might be able to do this in a preprocessing stage like so:

embeddings = vocab.GloVe(name='6B', dim=50)

def embed_text(embeddings, x):
    return embeddings.get_vecs_by_tokens(x[0]), x[1]

...
train_datapipe, test_datapipe = IMDB(split=('train', 'test'))
train_datapipe = train_datapipe.map(partial(process_labels, labels))
train_datapipe = train_datapipe.map(partial(tokenize_text, tokenize))
train_datapipe = train_datapipe.map(partial(embed_text, embeddings)) 
train_datapipe = train_datapipe.batch(batch_size=batch_size)
train_datapipe = train_datapipe.rows2columnar(["text", "label"])
train_dataloader = DataLoader(train_datapipe, batch_size=None)

This seems to make sense to me, however the models I’m seeing online often look like this, with the embeddings as a parameter

class VanillaRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(VanillaRNN, self).__init__()
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.rnn = nn.RNN(hidden_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        embedded = self.embedding(x)
        output, _ = self.rnn(embedded)
        out = output[:, -1, :]
        return self.fc(out)

Does the model need to have the embedding param if I’m already doing it in the preprocessing stage? (I’m thinking yes for backpropagation purposes.) If I need it in the model, is there a point in doing it during preprocessing? And finally, it seems like the DataLoader spits out batches in a list format, whereas embedding() requires a single tensor… how can I make these play nice?

vdw · November 22, 2023, 11:11am

As long as you don’t have the intention to further train the embedding layer, I see no reason why you could consider as part of the data preparation. Given your methode

    def forward(self, x):
        ...

The only difference would be how x looks like. With an embedding layer as part of the network, the shape would be (batch_size, seq_len). Without that layer it would need to be (batch_size, seq_len, embed_size).

This of course also means that you have to ensure that all sequences in a batch have the same length, for example, by padding all sequences shorter than the longest sequence. If you can ensure this, you should be able to convert the output of the data loader to a tensor of shape (batch_size, seq_len, embed_size).

Is there any particularly reason you want to do this?