How to predict a single sample on a trained LSTM model

roy · March 15, 2021, 7:52pm

Hello there,
I was reading an interesting blog on parsing addresses with training a recurrent neural network using pytorch:

In this blog there is a reference to a Google Colab Jupyter notebook.(https://colab.research.google.com/github/dot-layer/blog/blob/master/content/blog/2020-08-19-train-a-sequence-model-with-poutyne/article_notebook_colab.ipynb).

I want to predict the classes for a single address, on the trained model in the notebook, but I don’t know how. Help is very much appreciated!

mmg · June 21, 2021, 5:38am

@roy
The following code does it:

full_network.to(device)
full_network.eval()
res = []
tags_set = {
            "StreetNumber": 0,
            "StreetName": 1,
            "Unit": 2,
            "Municipality": 3,
            "Province": 4,
            "PostalCode": 5,
            "Orientation": 6,
            "GeneralDelivery": 7
        }
test_sent ='35 r de percé gatineau qc j8r 2e6'
test_sent_vec = embedding_vectorizer(test_sent)
test_sent_vectorizer = torch.tensor([test_sent_vec], dtype=torch.float32).to(device)
test_sent_vectorizer_len = torch.tensor([test_sent_vectorizer.size()[1]], dtype=torch.long).to(device) 

with torch.no_grad():
  test_sent_res = full_network(test_sent_vectorizer, test_sent_vectorizer_len)
  out = test_sent_res.cpu()[0]
  out = torch.argmax(out,dim=1)
  for c in out:
            res.append(list(tags_set.keys())[list(tags_set.values()).index(c.item())])

  print(f'Predicted: {res}')

P.S. I had to run and train the code and then worked on the above code

roy · June 24, 2021, 9:34am

Great, this is exactly what I meant.
It works! I know what you mean with running and training te code but it is very much appreciated.
Thanks for the great effort!

roy · July 23, 2021, 11:34am

Did you train with Pytorch or Poutyne? I dit it with Poutyne. How would the training loop look like wit Pytorch? I tried the following function, which does not work…

def train_loop(dataloader, model, loss_fn, optimizer):
size = len(dataloader.dataset)
for batch, (X, y) in enumerate(dataloader):
# Compute prediction and loss
pred = model(X)
loss = loss_fn(pred, y)

    # Backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    loss, current = loss.item(), batch * len(X)
    print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

mmg · July 23, 2021, 4:28pm

I use PyTorch Lightning (which I would personally recommend over Poutyne etc.)

You need to be more specific when you say ‘does not work’ : what is the error? Share a colab/notebook link and I will be able to debug.

roy · July 25, 2021, 8:21pm

Thanks for the tip. I will also try pytorch lightning. I’ve added the colab notebook code below from the article. The pytorch training loop code I added is from Optimizing Model Parameters — PyTorch Tutorials 1.9.0+cu102 documentation

You can see the error message I got, just before the Poutyne experiment:

TypeError Traceback (most recent call last)
in ()
5 for t in range(epochs):
6 print(f"Epoch {t+1}\n-------------------------------")
----> 7 train_loop(train_loader, full_network, loss_fn, optimizer)
8 test_loop(test_loader, full_network, loss_fn)
9 print(“Done!”)

in train_loop(dataloader, model, loss_fn, optimizer)
3 for batch, (X, y) in enumerate(dataloader):
4 # Compute prediction and loss
----> 5 pred = model(X)
6 loss = loss_fn(pred, y)
7

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1050 or _global_forward_hooks or _global_forward_pre_hooks):
→ 1051 return forward_call(*input, **kwargs)
1052 # Do not call functions when jit is used
1053 full_backward_hooks, non_full_backward_hooks = [], []

TypeError: forward() missing 1 required positional argument: ‘lengths’

Python Notebook Viewer

In this article, we will train an RNN, or more precisely, an LSTM, to predict the sequence of tags associated with a given address, known as address parsing.

Also, the article is available in a Jupyter Notebook or in a Google Colab Jupyter notebook.

Before starting this article, we would like to disclaim that this tutorial is greatly inspired by an online tutorial David created for the Poutyne framework. Also, the content is based on a recent article we wrote about address tagging. However, there are differences between the present work and the two others, as this one is specifically designed for the less technical reader.

Sequential data, such as addresses, are pieces of information that are deliberately given in a specific order. In other words, they are sequences with a particular structure; and knowing this structure is crucial for predicting the missing entries of a given truncated sequence. For example, when writing an address, we know, in Canada, that after the civic number (e.g. 420), we have the street name (e.g. du Lac). Hence, if one is asked to complete an address containing only a number, he can reasonably assume that the next information that should be added to the sequence is a street name. Various modelling approaches have been proposed to make predictions over sequential data. Still, more recently, deep learning models known as Recurrent Neural Network (RNN) have been introduced for this type of data.

The main purpose of this article is to introduce the various tricks (e.g., padding and packing) that are required for training an RNN. Before we do that, let us define our “address” problem more formally and elaborate on what RNNs (and LSTMs) actually are.
Address Tagging

Address tagging is the task of detecting and tagging the different parts of an address such as the civic number, the street name or the postal code (or zip code). The following figure shows an example of such a tagging.

address parsing canada

For our purpose, we define 8 pertinent tags that can be found in an address: [StreetNumber, StreetName, Orientation, Unit, Municipality, Province, PostalCode, GeneralDelivery].

Since addresses are sequences of arbitrary length where a word’s index does not mean as much as its position relative to others, one can hardly rely on a simple fully connected neural network for address tagging. A dedicated type of neural networks was specifically designed for this kind of tasks involving sequential data: RNNs.
Recurrent Neural Network (RNN)

In brief, an RNN is a neural network in which connections between nodes form a temporal sequence. It means that this type of network allows previous outputs to be used as inputs for the next prediction. For more information regarding RNNs, have a look at Stanford’s freely available cheastsheet.

For our purpose, we do not use the vanilla RNN, but a widely-use variant of it known as long short-term memory (LSTM) network. This latter, which involves components called gates, is often preferred over its competitors due to its better stability with respect to gradient update (vanishing and exploding gradient). To learn more about LSTMs, see here for an in-depth explanation.

For now, let’s simply use a single layer unidirectional LSTM. We will, later on, explore the use of more layers and a bidirectional approach.
Word Embeddings

Since our data is text, we will use a well-known text encoding technique: word embeddings. Word embeddings are vector representations of words. The main hypothesis underlying their use is that there exists a linear relation between words. For example, the linear relation between the word king and queen is gender. So logically, if we remove the vector corresponding to male to the one for king, and then add the vector for female, we should obtain the vector corresponding to queen (i.e. king - male + female = queen). That being said, this kind of representation is usually made in high dimensions such as 300, which makes it impossible for humans to reason about them. Neural networks, on the other hand, can efficiently make use of the implicit relations despite their high dimensionality.

We therefore fix our LSTM’s input and hidden state dimensions to the same sizes as the vectors of embedded words. For the present purpose, we will use the French pre-trained fastText embeddings of dimension 300.
The PyTorch Model

Let us first import all the necessary packages.

%pip install --upgrade poutyne #install poutyne
%pip install --upgrade colorama #install colorama
%pip install --upgrade pymagnitude-light #install pymagnitude-light
%matplotlib inline

import gzip
import os
import pickle
import shutil
import warnings

import requests
import torch
import torch.nn as nn
import torch.optim as optim
from poutyne import set_seeds
from poutyne.framework import Experiment
from pymagnitudelight import Magnitude
from torch.nn.functional import cross_entropy
from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence, pad_sequence
from torch.utils.data import DataLoader

Collecting poutyne
Downloading Poutyne-1.5-py3-none-any.whl (136 kB)
?25l
K |██▍ | 10 kB 27.1 MB/s eta 0:00:01
K |████▉ | 20 kB 25.9 MB/s eta 0:00:01
K |███████▏ | 30 kB 18.9 MB/s eta 0:00:01
K |█████████▋ | 40 kB 16.6 MB/s eta 0:00:01
K |████████████ | 51 kB 7.9 MB/s eta 0:00:01
K |██████████████▍ | 61 kB 9.3 MB/s eta 0:00:01
K |████████████████▊ | 71 kB 8.7 MB/s eta 0:00:01
K |███████████████████▏ | 81 kB 9.7 MB/s eta 0:00:01
K |█████████████████████▌ | 92 kB 10.2 MB/s eta 0:00:01
K |████████████████████████ | 102 kB 7.9 MB/s eta 0:00:01
K |██████████████████████████▎ | 112 kB 7.9 MB/s eta 0:00:01
K |████████████████████████████▊ | 122 kB 7.9 MB/s eta 0:00:01
K |███████████████████████████████ | 133 kB 7.9 MB/s eta 0:00:01
K |████████████████████████████████| 136 kB 7.9 MB/s
ent already satisfied: torch in /usr/local/lib/python3.7/dist-packages (from poutyne) (1.9.0+cu102)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from poutyne) (1.19.5)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch->poutyne) (3.7.4.3)
Installing collected packages: poutyne
Successfully installed poutyne-1.5
Collecting colorama
Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Installing collected packages: colorama
Successfully installed colorama-0.4.4
Collecting pymagnitude-light
Downloading pymagnitude_light-0.1.147-py3-none-any.whl (35 kB)
Collecting fasteners>=0.14.1
Downloading fasteners-0.16.3-py2.py3-none-any.whl (28 kB)
Collecting xxhash>=1.0.1
Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
K |████████████████████████████████| 243 kB 13.9 MB/s
-manylinux2010_x86_64.whl (1.8 MB)
K |████████████████████████████████| 1.8 MB 15.6 MB/s
ent already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.7/dist-packages (from pymagnitude-light) (1.19.5)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from fasteners>=0.14.1->pymagnitude-light) (1.15.0)
Installing collected packages: xxhash, lz4, fasteners, pymagnitude-light
Successfully installed fasteners-0.16.3 lz4-3.1.3 pymagnitude-light-0.1.147 xxhash-2.0.2

Now, let’s create a single (i.e. one layer) unidirectional LSTM with input_size and hidden_size of 300. We will explore later on the effect of stacking more layers and using a bidirectional approach.

See here why we use the batch_first argument.

dimension = 300
num_layer = 1
bidirectional = False

lstm_network = nn.LSTM(input_size=dimension,
hidden_size=dimension,
num_layers=num_layer,
bidirectional=bidirectional,
batch_first=True)

Fully-connected Layer

Since the output of the LSTM network is of dimension 300, we will use a fully-connected layer to map it into a space of equal dimension to that of the tag space (i.e. number of tags to predict), that is 8. Finally, since we want to predict the most probable tokens, we will apply the softmax function on this layer (see here if softmax does not ring a bell).

input_dim = dimension #the output of the LSTM
tag_dimension = 8

fully_connected_network = nn.Linear(input_dim, tag_dimension)

Training Constants

Now, let’s set our training constants. We first specify a CUDA (GPU) device for training (using a CPU takes way too long, if you don’t have one, you can use the Google Colab notebook).

Second, we set the batch size (i.e. the number of elements to see before updating the model), the learning rate for the optimizer and the number of epochs.

device = torch.device(“cuda:0”)

batch_size = 128
lr = 0.1

epoch_number = 10

We also need to set Pythons’s, NumPy’s and PyTorch’s random seeds using the Poutyne function to make our training (almost) completely reproducible.

See here for an explanation on why setting seed does not guarantee complete reproducibility.

set_seeds(42)

The Dataset

The dataset consists of 1,010,987 complete French and English Canadian addresses and their associated tags. Here’s an example address

“420 rue des Lilas Ouest, Québec, G1V 2V3”

and its corresponding tags

[StreetNumber, StreetName, StreetName, StreetName, Orientation, Municipality, PostalCode, PostalCode].

Now let’s download our dataset. For simplicity, a 100,000 addresses test set is kept aside, with 80% of the remaining addresses used for training and 20 % used as a validation set. Also note that the dataset was pickled for simplicity (using a Python list). Here is the code to download it.

def download_data(saving_dir, data_type):
“”"
Function to download the dataset using data_type to specify if we want the train, valid or test.
“”"

# hardcoded url to download the pickled dataset
root_url = "https://dot-layer.github.io/blog-external-assets/train_rnn/{}.p"

url = root_url.format(data_type)
r = requests.get(url)
os.makedirs(saving_dir, exist_ok=True)

open(os.path.join(saving_dir, f"{data_type}.p"), 'wb').write(r.content)

download_data(’./data/’, “train”)
download_data(’./data/’, “valid”)
download_data(’./data/’, “test”)

Now let’s load the data in memory.

load the data

train_data = pickle.load(open("./data/train.p", “rb”)) # 728,789 examples
valid_data = pickle.load(open("./data/valid.p", “rb”)) # 182,198 examples
test_data = pickle.load(open("./data/test.p", “rb”)) # 100,000 examples

As explained before, the (train) dataset is a list of 728,789 tuples where the first element is the full address, and the second is a list of tags (the ground truth).

train_data[:2] # The first two train items

[(‘35 r de percé gatineau qc j8r 2e6’,
[‘StreetNumber’,
‘StreetName’,
‘StreetName’,
‘StreetName’,
‘Municipality’,
‘Province’,
‘PostalCode’,
‘PostalCode’]),
(‘r bourque gatineau qc j8y 1x6’,
[‘StreetName’,
‘StreetName’,
‘Municipality’,
‘Province’,
‘PostalCode’,
‘PostalCode’])]

(the output)

data_snapshot
Vectorize the Dataset

Since we used word embeddings as the encoded representations of the words in the addresses, we need to convert the addresses into the corresponding word vectors. In order to do that, we will use a vectorizer (i.e. the process of converting words into vectors). This embedding vectorizer will extract, for each word, the embedding value based on the pre-trained French fastText model. We use French embeddings because French is the language in which most of the adresses in our dataset are written.

About Magnitude fastText model Since the original fastText model take a lot of RAM (~9 GO). I've come across magnitude when I've published a model, and the model was so large with the embedding that it could not fit in a typical computer. The idea behind Magnitude is to convert the original vectors into a mapping between the word and subword and the vectors using a local database. The conversion took about 8 hours to do, and the script is broken for fastText embeddings. It would be a little bit slower, but Google Colab doesn't allow us to use more than 12 GO. But a drawback of Magnitude is that for a reason I don't understand, I can't make it work in multithreading on three different computers, but it works on Colab, even if the doc says it should work easily.

def download_from_url(model: str, saving_dir: str, extension: str):
“”"
Simple function to download the content of a file from a distant repository.
“”"
print(“Downloading the model.”)
model_url = "https://graal.ift.ulaval.ca/public/deepparse/{}." + extension
url = model_url.format(model)
r = requests.get(url)

os.makedirs(saving_dir, exist_ok=True)
open(os.path.join(saving_dir, f"{model}.{extension}"), "wb").write(r.content)

def download_fasttext_magnitude_embeddings(saving_dir):
“”"
Function to download the magnitude pre-trained fastText model.
“”"
model = “fasttext”
extension = “magnitude”
file_name = os.path.join(saving_dir, f"{model}.{extension}")
if not os.path.isfile(file_name):
warnings.warn("The fastText pre-trained word embeddings will be download in magnitude format (2.3 GO), "
“this process will take several minutes.”)
extension = extension + “.gz”
download_from_url(model=model, saving_dir=saving_dir, extension=extension)
gz_file_name = file_name + “.gz”
print(“Unzip the model.”)
with gzip.open(os.path.join(saving_dir, gz_file_name), “rb”) as f:
with open(os.path.join(saving_dir, file_name), “wb”) as f_out:
shutil.copyfileobj(f, f_out)
os.remove(os.path.join(saving_dir, gz_file_name))
return file_name

class EmbeddingVectorizer:
def init(self, path="./"):
“”"
Embedding vectorizer
“”"
file_name = download_fasttext_magnitude_embeddings(saving_dir=path)
self.embedding_model = Magnitude(file_name)

def __call__(self, address):
    """
    Convert address to embedding vectors
    :param address: The address to convert
    :return: The embeddings vectors
    """
    embeddings = []
    for word in address.split():
        embeddings.append(self.embedding_model.query(word))
    return embeddings

embedding_vectorizer = EmbeddingVectorizer()

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:22: UserWarning: The fastText pre-trained word embeddings will be download in magnitude format (2.3 GO), this process will take several minutes.
Downloading the model.
Unzip the model.

We also need to apply a similar operation to the address tags (e.g. StreetNumber, StreetName). This time, however, the vectorizer needs to convert the tags into categorical values (e.g. StreetNumber → 0). For simplicity, we will use a DatasetBucket class that will apply the vectorizing process using both the embedding and the address vectorization process that we’ve just described during training.

class DatasetBucket:
def init(self, data, embedding_vectorizer):
self.data = data
self.embedding_vectorizer = embedding_vectorizer
self.tags_set = {
“StreetNumber”: 0,
“StreetName”: 1,
“Unit”: 2,
“Municipality”: 3,
“Province”: 4,
“PostalCode”: 5,
“Orientation”: 6,
“GeneralDelivery”: 7
}

def __len__(self):
    return len(self.data)

def __getitem__(self, item):  # We vectorize when data is asked
    data = self.data[item]
    return self._item_vectorizing(data)

def _item_vectorizing(self, item):
    address = item[0]
    address_vector = self.embedding_vectorizer(address)

    tags = item[1]
    idx_tags = self._convert_tags_to_idx(tags)

    return address_vector, idx_tags

def _convert_tags_to_idx(self, tags):
    idx_tags = []
    for tag in tags:
        idx_tags.append(self.tags_set[tag])
    return idx_tags

train_dataset_vectorizer = DatasetBucket(train_data, embedding_vectorizer)
valid_dataset_vectorizer = DatasetBucket(valid_data, embedding_vectorizer)
test_dataset_vectorizer = DatasetBucket(test_data, embedding_vectorizer)

Here is a example of the vectorizing process.

address, tag = train_dataset_vectorizer[0] # Unpack the first tuple
print(f"The vectorized address is now a list of vectors {address}")

DataLoader

We use a first trick, padding.

Now, because the addresses are not all of the same size, it is impossible to batch them together; recall that all tensor elements must have the same lengths. But there is a trick: padding!

The idea is simple; we add empty tokens at the end of each sequence until they reach the length of the longest one in the batch. For example, if we have three sequences of length ${1, 3, 5}$, padding will add 4 and 2 empty tokens respectively to the first two.

For the word vectors, we add vectors of 0 as padding. For the tag indices, we pad with -100’s. We do so because the cross-entropy loss and the accuracy metric both ignore targets with values of -100.

To do the padding, we use the collate_fn argument of the PyTorch DataLoader, and on running time, the process will be done by the DataLoader. One thing to keep in mind when treating padded sequences is that their original length will be required to unpad them later on in the forward pass. That way, we can pad and pack the sequence to minimize the training time (read this good explanation on why we pack sequences).

def pad_collate_fn(batch):
“”"
The collate_fn that can add padding to the sequences so all can have
the same length as the longest one.

Args:
    batch (List[List, List]): The batch data, where the first element
    of the tuple is the word idx and the second element are the target
    label.

Returns:
    A tuple (x, y). The element x is a tuple containing (1) a tensor of padded
    word vectors and (2) their respective original sequence lengths. The element
    y is a tensor of padded tag indices. The word vectors are padded with vectors
    of 0s and the tag indices are padded with -100s. Padding with -100 is done
    because of the cross-entropy loss and the accuracy metric ignores
    the targets with values -100.
"""

# This gets us two lists of tensors and a list of integer.
# Each tensor in the first list is a sequence of word vectors.
# Each tensor in the second list is a sequence of tag indices.
# The list of integer consist of the lengths of the sequences in order.
sequences_vectors, sequences_labels, lengths = zip(*[
    (torch.FloatTensor(seq_vectors), torch.LongTensor(labels), len(seq_vectors))
    for (seq_vectors, labels) in sorted(batch, key=lambda x: len(x[0]), reverse=True)
])

lengths = torch.LongTensor(lengths)

padded_sequences_vectors = pad_sequence(sequences_vectors, batch_first=True, padding_value=0)

padded_sequences_labels = pad_sequence(sequences_labels, batch_first=True, padding_value=-100)

return (padded_sequences_vectors, lengths), padded_sequences_labels

train_loader = DataLoader(train_dataset_vectorizer, batch_size=batch_size, shuffle=True, collate_fn=pad_collate_fn, num_workers=4)
valid_loader = DataLoader(valid_dataset_vectorizer, batch_size=batch_size, collate_fn=pad_collate_fn, num_workers=4)
test_loader = DataLoader(test_dataset_vectorizer, batch_size=batch_size, collate_fn=pad_collate_fn, num_workers=2)

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))

Full Network

We use a second trick, packing.

Since our sequences are of variable lengths and that we want to be as efficient as possible when packing them, we cannot use the PyTorch nn.Sequential class to define our model. Instead, we define the forward pass so that it uses packed sequences (again, you can read this good explanation on why we pack sequences).

class RecurrentNet(nn.Module):
def init(self, lstm_network, fully_connected_network):
super().init()
self.hidden_state = None

    self.lstm_network = lstm_network
    self.fully_connected_network = fully_connected_network

def forward(self, padded_sequences_vectors, lengths):
    """
        Defines the computation performed at every call.

        Shapes:
            padded_sequences_vectors: batch_size * longest_sequence_length (padding), 300
            lengths: batch_size

    """
    total_length = padded_sequences_vectors.shape[1]
    pack_padded_sequences_vectors = pack_padded_sequence(padded_sequences_vectors, lengths.cpu(), batch_first=True)

    lstm_out, self.hidden_state = self.lstm_network(pack_padded_sequences_vectors)
    lstm_out, _ = pad_packed_sequence(lstm_out, batch_first=True, total_length=total_length)

    tag_space = self.fully_connected_network(lstm_out) # shape: batch_size * longest_sequence_length, 8 (tag space)
    return tag_space.transpose(-1, 1) # we need to transpose since it's a sequence # shape: batch_size * 8, longest_sequence_length

full_network = RecurrentNet(lstm_network, fully_connected_network)

Summary

We have created an LSTM network (lstm_network) and a fully connected network (fully_connected_network), and we use both components in the full network. The full network makes use of padded-packed sequences, so we created the pad_collate_fn function to do the necessary work within the DataLoader. Finally, we will load the data using the vectorizer (within the DataLoader using the pad_collate function). This means that the addresses will be represented by word embeddings. Also, the address components will be converted into categorical value (from 0 to 7).
The Training

Now that we have all the components for the network, let’s define our optimizer (Stochastic Gradient Descent) (SGD).

optimizer = optim.SGD(full_network.parameters(), lr)

def train_loop(dataloader, model, loss_fn, optimizer):
size = len(dataloader.dataset)
for batch, (X, y) in enumerate(dataloader):
# Compute prediction and loss
pred = model(X)
loss = loss_fn(pred, y)

    # Backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    loss, current = loss.item(), batch * len(X)
    print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

def test_loop(dataloader, model, loss_fn):
size = len(dataloader.dataset)
num_batches = len(dataloader)
test_loss, correct = 0, 0

with torch.no_grad():
    for X, y in dataloader:
        pred = model(X)
        test_loss += loss_fn(pred, y).item()
        correct += (pred.argmax(1) == y).type(torch.float).sum().item()

test_loss /= num_batches
correct /= size
print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

loss_fn = nn.CrossEntropyLoss()
#optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

epochs = 2
for t in range(epochs):
print(f"Epoch {t+1}\n-------------------------------")
train_loop(train_loader, full_network, loss_fn, optimizer)
test_loop(test_loader, full_network, loss_fn)
print(“Done!”)

Epoch 1

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))

TypeError Traceback (most recent call last)
in ()
5 for t in range(epochs):
6 print(f"Epoch {t+1}\n-------------------------------")
----> 7 train_loop(train_loader, full_network, loss_fn, optimizer)
8 test_loop(test_loader, full_network, loss_fn)
9 print(“Done!”)

in train_loop(dataloader, model, loss_fn, optimizer)
3 for batch, (X, y) in enumerate(dataloader):
4 # Compute prediction and loss
----> 5 pred = model(X)
6 loss = loss_fn(pred, y)
7

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1050 or _global_forward_hooks or _global_forward_pre_hooks):
→ 1051 return forward_call(*input, **kwargs)
1052 # Do not call functions when jit is used
1053 full_backward_hooks, non_full_backward_hooks = [], []

TypeError: forward() missing 1 required positional argument: ‘lengths’

mmg · July 26, 2021, 4:06am

Your model expects two inputs if you consider this line

def forward(self, padded_sequences_vectors, lengths):

but in your train_loop function, you are only passing one argument

pred = model(X)

I would recommend the following changes:

In pad_collate_fn

return padded_sequences_vectors, lengths, padded_sequences_labels

in train_loop

for x,y,z in dataloader:
   pred = model(x,y)
   loss = loss_fn(pred, z)