DataLoader for a NLP problem

tony_ferg · November 4, 2018, 4:27pm

Hello, I’m trying to implement DataLoader for batch processing in a CNN LSTM NLP model.

This is my first time using Pytorch so I’m feeling quite lost and will appreciate some direction.

Suppose my training loop looks like this:

 for epoch in range(n_epochs):
         epoch_loss = 0.0
         interval_loss = 0.0

         for iter, train_data in enumerate(train_loader):
             words, labels, chars, word_lengths = train.encode(train_data, cuda=cuda)
             optimizer.zero_grad()
             preds = model(words, chars, word_lengths)
             loss = F.cross_entropy(preds, labels)
             loss.backward()
             optimizer.step()
 
             epoch_loss += loss.item()
             interval_loss += loss.item()

Originally the train.encode function looks like this (in which it takes in only a sentence at one time):

    def encode(self, i, cuda=True):
        # encode the sentence at index i
        sentence = self.corpus_original[i]
        sentence_lowered = self.corpus_lowered[i]
        sentence_tags = self.tags[i]

        words = torch.zeros(len(sentence), dtype=torch.long)
        labels = torch.zeros(len(sentence), dtype=torch.long)
        chars = torch.zeros(len(sentence), self.max_word_len,
                            dtype=torch.long)
        word_lengths = np.zeros(len(sentence), dtype=int) # change to tensor if needed

        # Workonthis - can add in the non-lowered words for traning for RNN. Hence it will be
        # words_lowered and words_original
        for j in range(len(sentence)):
            labels[j] = self.tag2idx[sentence_tags[j]]
            words[j] = self.word2idx[sentence_lowered[j]]
            word_lengths[j] = len(sentence[j])
            for k, c in enumerate(sentence[j]): # for each character
                chars[j,k] = self.char2idx[c]

        if cuda:
            words = words.cuda()
            labels = labels.cuda()
            chars = chars.cuda()

        return words, labels, chars, word_lengths

My purpose now is to take in many sentences at once, for example, 32 sentences. My question is will the DataLoader output a tensor of 32 layers, where each layer represents a sentence? If so, how should I write the encode function so that I am utilising multi-core?

I apologise if my questions are very amateurish, I’ve tried my best to read other implementation and the documentation but I’m still feeling hopelessly lost.