NLP in Pytorch Tutorial

Hi, I have been working on a tutorial as a fast introduction to deep learning NLP with Pytorch. I feel that the current tutorials focus mostly on CV. There are some NLP examples out there, but I didn’t find anything for beginners (which I am looking for, since we are using Pytorch for an NLP class I am TA’ing). So I wrote a tutorial. It assumes NLP knowledge and familiarity with neural nets, but not with deep learning programming.

I wanted to post the tutorial here to get feedback and also because I figure it may be helpful to some people. There are some fast explanations and a lot of code, with a few working examples (nothing state of the art, just things to get an idea). I still need to add a BiLSTM-CRF tagger for NER example, which will be the most complicated one. Here’s the link.

If you look at it, I’m happy to get any feedback. I want it to be useful to the students in my class.

9 Likes

Also checkout https://github.com/spro/practical-pytorch which has some NLP tutorials.

3 Likes

Hey, nice work and thanks for sharing!

I have some minor suggestions:

  1. make_bow_vector (cell evaluated as 91) - create vec using torch.zeros(len(word_to_idx))
  2. I’d mention that NLLLoss expects log-probabilities, but you could also use CrossEntropyLoss if you removed the log_softmax.
  3. I’d split the log_probs line in cell 101 into a few more. It’s not very readable with that indentation.

Also, I’d also recommend using tensor indexing to create BoW vectors, as that will likely be faster than iterating over a list in tensor constructor.

2 Likes

Hi, Thanks for the comments! I will update it when I get the chance.

1 Like

@rguthrie3, thanks for the amazing tutorial!

By any chance, did you write down a solution for the pretrained embeddings exercise?
Best,
D

I got an implementation of CBOW here. Please try to finish it yourself until check others!

1 Like

H zhutaoi, it is very nice for you to post your implementation. But I have some questions about it.

In your code, you defined your CBOW model as same as the author’s NGramLanguageModeler and change the number of context size during training. I believe this can work but I don’t think it matches the definition of the CBOW model, which is (A*sum(q) + b). I think you can throw the context number away and add your context together before feed in the linear layer.

I am not in the area of NLP so if I misunderstood the model or made a mistake, please point it out and I am happy to discuss with you.

Thanks!

I think it should be along this line:

class CBOW(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim):
        super(CBOW, self).__init__()
        self.embedding = nn.Embedding(num_embeddings=vocab_size,
                                      embedding_dim=embedding_dim)
        
        self.linear = nn.Linear(in_features=embedding_dim,
                                 out_features=vocab_size)
        
    def forward(self, x):
        # embeds 4 context words into say, 10 dim,
        # then take their sum along the rows (dim=0) to get 1 by 10 vector
        embedding = self.embedding(x).sum(dim=0)  
        out = self.linear(embedding)
        out = F.log_softmax(out)
        return out

But I’m getting RuntimeError: index out of range at /py/conda-bld/pytorch_1493674854206/work/torch/lib/TH/generic/THTensorMath.c:273 and I don’t know why?

Because of the way you generate the word_to_ix dict. In the codes author provided, he generated the dict as:

word_to_ix = {word: i for i, word in enumerate(raw_text)}

Note here he enumerate over raw_text but not vocab. I guess that is why you get an index out of range error. You can either change the raw_text to vocab, or set the vocab_size to be the length of raw_text. Hope this can address your issue.

1 Like

Ah, right! of course, it makes more sense to enumerate vocab for later embedding.

Thank you very much this is really good for starters. However, as I am new to PyTorch I am looking for any tutorial that can handle sparse operations as I am dealing with one hot vectors. Please guide if you know any such tutorials.

Sincerely,

I am new to pytorch and learning NLP/deep learning.
I was going through the CBOW model mentioned here and the explanation mentioned on tutorial page/exercise (here) . In the former, two matrices are learned while in the later we only learn the embeddings of the words, A and B parameters. I think both are saying the same things but I couldn’t understand how.

I implemented the exercise of CBOW (my code is below). Please let me know if it looks okay.


import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.autograd as autograd

CONTEXT_SIZE = 2
EMBED_SIZE = 10
raw_text = “”“We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.”"".split()

vocab = set(raw_text)

word_to_idx = {word: i for i, word in enumerate(vocab)}
idx_to_word = {i: word for i, word in enumerate(vocab)}
print(word_to_idx)
context_target = [ ([raw_text[i-2], raw_text[i-1], raw_text[i+1] , raw_text[i+2]], raw_text[i]) for i in range(2, len(raw_text)-2)]

class CBOWClassifier(nn.Module):

def __init__ (self, vocab_size, embed_size, context_size):
	super(CBOWClassifier,self).__init__()
	self.embeddings = nn.Embedding(vocab_size, embed_size)
	self.linear1 = nn.Linear(embed_size, 128)
	self.linear2 = nn.Linear(128, vocab_size)

def forward(self, inputs):
	embed = self.embeddings(inputs)
	embed = torch.sum(embed, dim=0)
	out = self.linear1(embed)
	out = F.relu(out)
	out = self.linear2(out)
	log_probs = F.log_softmax(out)
	return log_probs

VOCAB_SIZE = len(word_to_idx)

model = CBOWClassifier(VOCAB_SIZE, EMBED_SIZE, 2*CONTEXT_SIZE)
losses = []
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epochs in range(100):
total_loss = torch.Tensor([0])
for context, target in context_target:

	context_idx = [word_to_idx[w] for w in context]
	context_var = autograd.Variable(torch.LongTensor(context_idx))
	model.zero_grad()
	log_probs = model(context_var)
	target_idx = word_to_idx[target]
	loss = loss_function(log_probs, autograd.Variable(torch.LongTensor([target_idx])))
	loss.backward()
	optimizer.step()
	total_loss = total_loss + loss.data
losses.append(total_loss)

print(losses)


Hello,

I also implemented the CBOW model as follows:

Loss is decresing but how much epoch is needed to get the output for the CBOW exercise?

I’ve been reading this tutorial and would like to ask why use this line:
hello_embed = embeds(autograd.Variable(lookup_tensor))

instead of
hello_embed = embeds(Variable(lookup_tensor))

In other words, why wrap the Variable around with an autograd?

Because type(hello_embed) for the two lines produce the same result. (<class ‘torch.autograd.variable.Variable’>_

Thanks!

@rguthrie3 Hi, I saw you don’t have a language model example…I am working on a clean implementation of a language model for Word level LM…I guess you might know a bit about this question: Why is Hidden Variable out of Network Class in Pytorch examples Language Model? . Please take a look! Thanks for the tutorial btw. They are usually very helpful :slight_smile:

Anyone see issues with my implementation?

EMBEDDING_DIM = 10

raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))
print(data[:5])


class CBOW(nn.Module):

    def __init__(self, vocab_size, embedding_dim):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)        
              

    def forward(self, inputs):  
        embeds = self.embeddings(inputs).sum(0).view((1,-1))
        out = self.linear1(F.relu(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs
    
losses = []
loss_function = nn.NLLLoss()
model = CBOW(len(vocab), EMBEDDING_DIM)
optimizer = optim.SGD(model.parameters(), lr=0.001)

# create your model and train.  here are some functions to help you make
# the data ready for use by your module


def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)


for epoch in range(10): # train a bit
    total_loss = torch.Tensor([0])
    for context, target in data:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in variables)
        context_idxs = torch.tensor(make_context_vector(context, word_to_ix), dtype=torch.long)

        
        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a variable)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)

# keep training
epoch_count = 10
print("Training until loss is less than 1..")
while losses[-1] >= 1: # go until seriously overfitting :)
    total_loss = torch.Tensor([0])
    for context, target in data:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in variables)
        context_idxs = torch.tensor(make_context_vector(context, word_to_ix), dtype=torch.long)

        
        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a variable)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
    epoch_count += 1
    
print("Final loss of %0.4f in %d epochs" % (float(losses[-1]), epoch_count))

# Test
correct = 0
for context, target in data:
    context_idxs = torch.tensor(make_context_vector(context, word_to_ix), dtype=torch.long)    
    log_probs = model(context_idxs)
    _, ix = torch.max(log_probs, 1)
    prediction = next(key for key, value in word_to_ix.items() if value == int(ix))
    correct += target == prediction
    
accuracy = correct / len(data)
print("Average accuracy:", accuracy)
Training util loss is less than 1..
Final loss of 0.9996 in 1436 epochs
Average accuracy: 1.0

What concerned me was that the template defines CONTEXT_SIZE rather than EMBEDDING_DIM, but the equation looks like it sums over 2*CONTEXT_SIZE, so this parameter is should not actually be required? Maybe just a typo?

1 Like