Feeding data into LSTM with batches

Aref_Ghodamai1 · October 25, 2021, 4:32pm

I’m trying to make a NLP NER LSTM model, and my data looks like this:
input:

input = [["i","am","here"], ["you","are","not","there"]]

and output:

output = [[1, 2,1], [1,2,1,2]]

As you can see, the length of input tensors are not the same. for batch_size =1, but increasing batch size to any thing bigger that 1, creates this error:

Traceback (most recent call last):
  File "stage_runner.py", line 28, in <module>
    main()
  File "stage_runner.py", line 24, in main
    job_table[args.job](**params)
  File "/Users/arefghodamai/Desktop/Projects/key_value_extraction/src/dvc_dags/train_model.py", line 22, in train_model
    for sentence, tags in model_manager.data_loader:
  File "/usr/local/Cellar/python@3.7/3.7.10_3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/usr/local/Cellar/python@3.7/3.7.10_3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/Cellar/python@3.7/3.7.10_3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/usr/local/Cellar/python@3.7/3.7.10_3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 83, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/usr/local/Cellar/python@3.7/3.7.10_3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 83, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/usr/local/Cellar/python@3.7/3.7.10_3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)

Is there any way to make this right, besides padding?

AbdulsalamBande · November 20, 2021, 6:12pm

Hello @Aref_Ghodamai1 . You have to know that every sentence in a batch has to have the same length. So in your case ["i","am","here"] has a length of 3 while [“you”,“are”,“not”,“there”] has a length of 4, and this is a problem. What you can do is to make [“i”,“am”,“here”] to [“i”,“am”,“here”,“pad_tensor”] or clip [“you”,“are”,“not”,“there”] to [“you”,“are”,“not”].
The point is to make sure the sentences have same length. Thank you

djlindsay · November 20, 2021, 9:07pm

Pad, my brother. Or cut N-Grams (look em up) with the lengths of the minimum length input. Here’s what I did in my code. The batch function is separate.

# create sequences of length 5 tokens
def create_seq(text, seq_len = 5):
    
    sequences = []
    words=text.split(' ')

    # if the number of tokens in 'text' is greater than 5
    if len(words) > seq_len:
        for i in range(seq_len, len(words)+1):
            # select sequence of tokens
            seq = words[i-seq_len:i]
            # add to the list
            sequences.append(" ".join(seq))
    elif len(words) < seq_len:
        seq=words
        for i in range(len(words),seq_len):
            seq.append("<pad>")
        sequences.append(" ".join(seq))

    # if the number of tokens in 'text' is equal to 5
    else:
        sequences=[text]
    return sequences

seqs5=sum([create_seq(i) for i in clean_lines],[]) # clean lines is a big list of sentences (strings)
seqs5