Performance: slow training of embeddings


The actual task at hand is a regular text classification where I can achieve 87% accuracy with a linear SVM already. After some research, I checked out PyTorch to boost that to over 90% if possible.

So, I started here to create embeddings for my training set using all the constants from the tutorial.

However, I recognized that the training was utterly slow. See the code below for details.

len(trigrams) = 3,018,172
len(vocab) = 758,019

Results in 148.6 sec / 1000 trigrams

What I did to improve the performance so far:

  1. reduce EMBEDDING_DIM
  2. reduce vocab size by using a custom tokenizer
  3. remove duplicate trigrams
  4. apply idx to all trigrams before training

What else can I do?

  • Would it help to pre-generate these tensors in the inner loop?

  • Moreover, Is it possible do run the training in parallel/batches (I remember Tensorflow having this sort of mini-batches)? So, I could push through 100 trigrams at once. If so, how?

  • Or even pushing all the trigrams at once through the net (batch-size = len(trigrams))?

Thanks in advance.


trigrams = []
vec = CountVectorizer(tokenizer=tokenizer)
vocab = defaultdict()
vocab.default_factory = vocab.__len__
analyze = vec.build_analyzer()
for doc in train_data:
    feature_idxs = [vocab[feature] for feature in analyze(doc)]
    trigrams.extend(zip(feature_idxs, feature_idxs[1:], feature_idxs[2:]))
trigrams = list(set(trigrams))


class NGramLanguageModeler(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 64)
        self.linear2 = nn.Linear(64, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(30):
    total_loss = 0
    for trigram in trigrams:  # ONE ITERATION HERE IS REALLY SLOW
        context_idxs = torch.tensor(trigram[:-1], dtype=torch.long)
        log_probs = model(context_idxs)
        loss = loss_function(log_probs, torch.tensor([trigram[-1]], dtype=torch.long))
        total_loss += loss.item()

Have you tried using sparse gradients (with sparse=True in embedding constructor)?

That gives me roughly 10% improvement:

# from
100 145.22 sec/1000  ->  7305.26 min total
200 140.84 sec/1000  ->  7084.80 min total
300 139.62 sec/1000  ->  7023.46 min total

# to
100 129.39 sec/1000  ->  6508.96 min total
200 129.43 sec/1000  ->  6510.97 min total
300 130.56 sec/1000  ->  6567.64 min total

I met a very similar problem, have you tried using gpu?

here is my code:

import torch.optim as optim
import torch
import torch.nn as nn
import numpy as np
import os
import time
if __name__ == '__main__':
    os.environ['CUDA_VISIBLE_DEVICES'] = '0'
    syn0 = torch.randn((2829,100),requires_grad=True,device='cuda')
    syn1 = torch.randn((2829,100),requires_grad=True,device='cuda')
    optimizer = optim.SGD([syn0, syn1], lr=0.025)
    Lossfunc = nn.BCELoss(reduction='sum').cuda()
    start1 = time.time()
    for index,_ in enumerate(range(40000)):
        word_in = np.random.randint(low=0, high=2829, size=32)
        word_out = np.random.randint(low=0, high=2829, size=32)
        if index%10000 == 0:
            print(('%d of 40000 (%.2f%%)')%(index,index / 400.0))
        word_in1 = torch.cuda.LongTensor(word_in)
        word_out1 = torch.cuda.LongTensor(word_out)
        label = torch.cuda.FloatTensor([1]+ [0]*31)
        emb_u = nn.functional.embedding(word_in1,syn0,sparse=True)
        emb_v = nn.functional.embedding(word_out1,syn1,sparse=True)
        outs = torch.sigmoid(torch.sum(torch.mul(emb_u, emb_v), dim=-1))
        loss = Lossfunc(outs,label)
    print (time.time()-start1)

I don’t know how to speed up it.

Haven’t tried GPU yet. I thought CPU is enough for my relatively small problem.
Besides sparse=True, I also pulled out the tensor creation from the tight loop.

However, it’s still not significantly faster. As of now, I see 2 options, I still can pursue:

A) using batches instead of feeding trigrams one by one -> does somebody know how to do this?
B) using GPU

Hi @SimonW, to avoid any misunderstanding about sparse=True, so if I set it to False, does it mean that during training, the gradient matrix shape is the same as Embedding weights, and although most of the gradient matrix will be 0s, right?

If so, for super large Embedding, sparse=False should make training very slow, since the gradient matrix will be a super huge dense matrix, right?