Hi!
The actual task at hand is a regular text classification where I can achieve 87% accuracy with a linear SVM already. After some research, I checked out PyTorch to boost that to over 90% if possible.
So, I started here https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html to create embeddings for my training set using all the constants from the tutorial.
However, I recognized that the training was utterly slow. See the code below for details.
len(trigrams) = 3,018,172
len(vocab) = 758,019
Results in 148.6 sec / 1000 trigrams
What I did to improve the performance so far:
- reduce EMBEDDING_DIM
- reduce vocab size by using a custom tokenizer
- remove duplicate trigrams
- apply idx to all trigrams before training
What else can I do?
-
Would it help to pre-generate these tensors in the inner loop?
-
Moreover, Is it possible do run the training in parallel/batches (I remember Tensorflow having this sort of mini-batches)? So, I could push through 100 trigrams at once. If so, how?
-
Or even pushing all the trigrams at once through the net (batch-size = len(trigrams))?
Thanks in advance.
CODE:
trigrams = []
vec = CountVectorizer(tokenizer=tokenizer)
vocab = defaultdict()
vocab.default_factory = vocab.__len__
analyze = vec.build_analyzer()
for doc in train_data:
feature_idxs = [vocab[feature] for feature in analyze(doc)]
trigrams.extend(zip(feature_idxs, feature_idxs[1:], feature_idxs[2:]))
trigrams = list(set(trigrams))
CONTEXT_SIZE = 2
EMBEDDING_DIM = 8
class NGramLanguageModeler(nn.Module):
def __init__(self, vocab_size, embedding_dim, context_size):
super(NGramLanguageModeler, self).__init__()
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
self.linear1 = nn.Linear(context_size * embedding_dim, 64)
self.linear2 = nn.Linear(64, vocab_size)
def forward(self, inputs):
embeds = self.embeddings(inputs).view((1, -1))
out = F.relu(self.linear1(embeds))
out = self.linear2(out)
log_probs = F.log_softmax(out, dim=1)
return log_probs
losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)
for epoch in range(30):
total_loss = 0
for trigram in trigrams: # ONE ITERATION HERE IS REALLY SLOW
model.zero_grad()
context_idxs = torch.tensor(trigram[:-1], dtype=torch.long)
log_probs = model(context_idxs)
loss = loss_function(log_probs, torch.tensor([trigram[-1]], dtype=torch.long))
loss.backward()
optimizer.step()
total_loss += loss.item()
print(total_loss)