NLP in Pytorch Tutorial

rguthrie3 · March 5, 2017, 6:41am

Hi, I have been working on a tutorial as a fast introduction to deep learning NLP with Pytorch. I feel that the current tutorials focus mostly on CV. There are some NLP examples out there, but I didn’t find anything for beginners (which I am looking for, since we are using Pytorch for an NLP class I am TA’ing). So I wrote a tutorial. It assumes NLP knowledge and familiarity with neural nets, but not with deep learning programming.

I wanted to post the tutorial here to get feedback and also because I figure it may be helpful to some people. There are some fast explanations and a lot of code, with a few working examples (nothing state of the art, just things to get an idea). I still need to add a BiLSTM-CRF tagger for NER example, which will be the most complicated one. Here’s the link.

If you look at it, I’m happy to get any feedback. I want it to be useful to the students in my class.

smth · March 5, 2017, 3:46pm

Also checkout https://github.com/spro/practical-pytorch which has some NLP tutorials.

apaszke · March 5, 2017, 10:43pm

Hey, nice work and thanks for sharing!

I have some minor suggestions:

make_bow_vector (cell evaluated as 91) - create vec using torch.zeros(len(word_to_idx))
I’d mention that NLLLoss expects log-probabilities, but you could also use CrossEntropyLoss if you removed the log_softmax.
I’d split the log_probs line in cell 101 into a few more. It’s not very readable with that indentation.

Also, I’d also recommend using tensor indexing to create BoW vectors, as that will likely be faster than iterating over a list in tensor constructor.

rguthrie3 · March 6, 2017, 1:20am

Hi, Thanks for the comments! I will update it when I get the chance.

denizs · May 8, 2017, 9:39am

@rguthrie3, thanks for the amazing tutorial!

By any chance, did you write down a solution for the pretrained embeddings exercise?
Best,
D

zhutao · May 9, 2017, 12:22pm

I got an implementation of CBOW here. Please try to finish it yourself until check others!

github.com

towerjoo/pytorch-playground/blob/master/official_tut/cbow.py

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()
word_to_ix = {}
for i, word in enumerate(raw_text):
    if word not in word_to_ix:
        word_to_ix[word] = len(word_to_ix)
data = []

This file has been truncated. show original

pengkaizhu · May 18, 2017, 8:10pm

H zhutaoi, it is very nice for you to post your implementation. But I have some questions about it.

In your code, you defined your CBOW model as same as the author’s NGramLanguageModeler and change the number of context size during training. I believe this can work but I don’t think it matches the definition of the CBOW model, which is (A*sum(q) + b). I think you can throw the context number away and add your context together before feed in the linear layer.

I am not in the area of NLP so if I misunderstood the model or made a mistake, please point it out and I am happy to discuss with you.

Thanks!

ehsanmok · May 22, 2017, 4:29am

I think it should be along this line:

class CBOW(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim):
        super(CBOW, self).__init__()
        self.embedding = nn.Embedding(num_embeddings=vocab_size,
                                      embedding_dim=embedding_dim)
        
        self.linear = nn.Linear(in_features=embedding_dim,
                                 out_features=vocab_size)
        
    def forward(self, x):
        # embeds 4 context words into say, 10 dim,
        # then take their sum along the rows (dim=0) to get 1 by 10 vector
        embedding = self.embedding(x).sum(dim=0)  
        out = self.linear(embedding)
        out = F.log_softmax(out)
        return out

But I’m getting RuntimeError: index out of range at /py/conda-bld/pytorch_1493674854206/work/torch/lib/TH/generic/THTensorMath.c:273 and I don’t know why?

pengkaizhu · May 22, 2017, 3:04pm

Because of the way you generate the word_to_ix dict. In the codes author provided, he generated the dict as:

word_to_ix = {word: i for i, word in enumerate(raw_text)}

Note here he enumerate over raw_text but not vocab. I guess that is why you get an index out of range error. You can either change the raw_text to vocab, or set the vocab_size to be the length of raw_text. Hope this can address your issue.

ehsanmok · May 22, 2017, 3:43pm

Ah, right! of course, it makes more sense to enumerate vocab for later embedding.

pjavia · July 3, 2017, 8:33pm

Thank you very much this is really good for starters. However, as I am new to PyTorch I am looking for any tutorial that can handle sparse operations as I am dealing with one hot vectors. Please guide if you know any such tutorials.

Sincerely,

sanket · July 15, 2017, 9:38pm

I am new to pytorch and learning NLP/deep learning.
I was going through the CBOW model mentioned here and the explanation mentioned on tutorial page/exercise (here) . In the former, two matrices are learned while in the later we only learn the embeddings of the words, A and B parameters. I think both are saying the same things but I couldn’t understand how.

I implemented the exercise of CBOW (my code is below). Please let me know if it looks okay.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.autograd as autograd

CONTEXT_SIZE = 2
EMBED_SIZE = 10
raw_text = “”“We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.”"".split()

vocab = set(raw_text)

word_to_idx = {word: i for i, word in enumerate(vocab)}
idx_to_word = {i: word for i, word in enumerate(vocab)}
print(word_to_idx)
context_target = [ ([raw_text[i-2], raw_text[i-1], raw_text[i+1] , raw_text[i+2]], raw_text[i]) for i in range(2, len(raw_text)-2)]

class CBOWClassifier(nn.Module):

def __init__ (self, vocab_size, embed_size, context_size):
	super(CBOWClassifier,self).__init__()
	self.embeddings = nn.Embedding(vocab_size, embed_size)
	self.linear1 = nn.Linear(embed_size, 128)
	self.linear2 = nn.Linear(128, vocab_size)

def forward(self, inputs):
	embed = self.embeddings(inputs)
	embed = torch.sum(embed, dim=0)
	out = self.linear1(embed)
	out = F.relu(out)
	out = self.linear2(out)
	log_probs = F.log_softmax(out)
	return log_probs

VOCAB_SIZE = len(word_to_idx)

model = CBOWClassifier(VOCAB_SIZE, EMBED_SIZE, 2*CONTEXT_SIZE)
losses = []
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epochs in range(100):
total_loss = torch.Tensor([0])
for context, target in context_target:

	context_idx = [word_to_idx[w] for w in context]
	context_var = autograd.Variable(torch.LongTensor(context_idx))
	model.zero_grad()
	log_probs = model(context_var)
	target_idx = word_to_idx[target]
	loss = loss_function(log_probs, autograd.Variable(torch.LongTensor([target_idx])))
	loss.backward()
	optimizer.step()
	total_loss = total_loss + loss.data
losses.append(total_loss)

print(losses)

emirceyani · September 2, 2017, 8:07pm

Hello,

I also implemented the CBOW model as follows:

gist.github.com

https://gist.github.com/emirceyani/2ca7d8c3c9a2704d0f1e7f72cfbdac72

CBOW.py

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
EMBEDDING_DIM=10

This file has been truncated. show original

Loss is decresing but how much epoch is needed to get the output for the CBOW exercise?

yuqli · October 8, 2017, 2:58am

I’ve been reading this tutorial and would like to ask why use this line:
hello_embed = embeds(autograd.Variable(lookup_tensor))

instead of
hello_embed = embeds(Variable(lookup_tensor))

In other words, why wrap the Variable around with an autograd?

Because type(hello_embed) for the two lines produce the same result. (<class ‘torch.autograd.variable.Variable’>_

Thanks!

Andy_Markman · January 12, 2018, 11:56am

@rguthrie3 Hi, I saw you don’t have a language model example…I am working on a clean implementation of a language model for Word level LM…I guess you might know a bit about this question: Why is Hidden Variable out of Network Class in Pytorch examples Language Model? . Please take a look! Thanks for the tutorial btw. They are usually very helpful

austinmw · June 28, 2018, 7:19pm

Anyone see issues with my implementation?

EMBEDDING_DIM = 10

raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))
print(data[:5])


class CBOW(nn.Module):

    def __init__(self, vocab_size, embedding_dim):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)        
              

    def forward(self, inputs):  
        embeds = self.embeddings(inputs).sum(0).view((1,-1))
        out = self.linear1(F.relu(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs
    
losses = []
loss_function = nn.NLLLoss()
model = CBOW(len(vocab), EMBEDDING_DIM)
optimizer = optim.SGD(model.parameters(), lr=0.001)

# create your model and train.  here are some functions to help you make
# the data ready for use by your module


def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)


for epoch in range(10): # train a bit
    total_loss = torch.Tensor([0])
    for context, target in data:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in variables)
        context_idxs = torch.tensor(make_context_vector(context, word_to_ix), dtype=torch.long)

        
        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a variable)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)

# keep training
epoch_count = 10
print("Training until loss is less than 1..")
while losses[-1] >= 1: # go until seriously overfitting :)
    total_loss = torch.Tensor([0])
    for context, target in data:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in variables)
        context_idxs = torch.tensor(make_context_vector(context, word_to_ix), dtype=torch.long)

        
        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a variable)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
    epoch_count += 1
    
print("Final loss of %0.4f in %d epochs" % (float(losses[-1]), epoch_count))

# Test
correct = 0
for context, target in data:
    context_idxs = torch.tensor(make_context_vector(context, word_to_ix), dtype=torch.long)    
    log_probs = model(context_idxs)
    _, ix = torch.max(log_probs, 1)
    prediction = next(key for key, value in word_to_ix.items() if value == int(ix))
    correct += target == prediction
    
accuracy = correct / len(data)
print("Average accuracy:", accuracy)

Training util loss is less than 1..
Final loss of 0.9996 in 1436 epochs
Average accuracy: 1.0

What concerned me was that the template defines CONTEXT_SIZE rather than EMBEDDING_DIM, but the equation looks like it sums over 2*CONTEXT_SIZE, so this parameter is should not actually be required? Maybe just a typo?