Hi,
I’m trying to make a basic RNN using glove for embeddings. Currently I’m thinking I might be able to do this in a preprocessing stage like so:
embeddings = vocab.GloVe(name='6B', dim=50)
def embed_text(embeddings, x):
return embeddings.get_vecs_by_tokens(x[0]), x[1]
...
train_datapipe, test_datapipe = IMDB(split=('train', 'test'))
train_datapipe = train_datapipe.map(partial(process_labels, labels))
train_datapipe = train_datapipe.map(partial(tokenize_text, tokenize))
train_datapipe = train_datapipe.map(partial(embed_text, embeddings))
train_datapipe = train_datapipe.batch(batch_size=batch_size)
train_datapipe = train_datapipe.rows2columnar(["text", "label"])
train_dataloader = DataLoader(train_datapipe, batch_size=None)
This seems to make sense to me, however the models I’m seeing online often look like this, with the embeddings as a parameter
class VanillaRNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(VanillaRNN, self).__init__()
self.embedding = nn.Embedding(input_size, hidden_size)
self.rnn = nn.RNN(hidden_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
embedded = self.embedding(x)
output, _ = self.rnn(embedded)
out = output[:, -1, :]
return self.fc(out)
Does the model need to have the embedding param if I’m already doing it in the preprocessing stage? (I’m thinking yes for backpropagation purposes.) If I need it in the model, is there a point in doing it during preprocessing? And finally, it seems like the DataLoader spits out batches in a list format, whereas embedding() requires a single tensor… how can I make these play nice?