BERT for multiple sentences

I know that [CLS] means the start of a sentence and [SEP] makes BERT know the second sentence has begun.

However, I have a question.

If I have 3 sentences, which are s1 and s2 and s3, and our fine-tuning task is the same.

In one way, I add special tokens and the input looks like [CLS]+s1+[SEP] + s2 + [SEP]+ s3 [SEP].

from transformers import BertTokenizer

model_name='allenai/scibert_scivocab_uncased'
model_name = 'bert-base-uncased'

tokenizer= AutoTokenizer.from_pretrained(model_name)

tokenizer([[“hello i am going to”,“Rome”]])

Outputs correct ie:
{'input_ids': [102, 121, 1972, 705, 8303, 147, 758, 26241, 188, 9319, 24616, 422, 6474, 129, 9724, 650, 3978, 190, 9319, 24616, 13903, 372, 1942, 19226, 24648, 4931, 137, 3594, 17345, 190, 6666, 422, 334, 12605, 30113, 475, 506, 3437, 131, 633, 10945, 5458, 263, 17094, 205, 103]}

But, while tokenizing three sentences like this :-
tokenizer([["hello i am going","to","Rome"]])
It give the following error :-
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

How do we input such a sequences to the BERT and it is justified to do so?

Thanks for any help!

Hi,

I tried to reproduce the error, but could not do it. With your example, I get the following (no error)

# Input
tokenizer(["hello i am going","to","Rome"])

# Output
# {'input_ids': [[101, 7592, 1045, 2572, 2183, 102], [101, 2000, 102], [101, 4199, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1], [1, 1, 1]]}

Sorry, I missed one bracket. Edited the question.
What i want is 1 Dimensional ids of multiple sentences provided by BERT

tl:dr;

your approach is correct. Don´t think you need to change it.

Long answer

If you wanted each sentence to be a batch, then you would do it as in the first post (before editing).

Here is more information about batches and preprocessing for the tokenizer.

tokenizer(["hello i am going","to","Rome"])

However, as you mentioned before, you do NOT want this, but rather 1 Dimensional ids of multiple sentences.

If you go to the BertTokenizer and to the PreTrainedTokenizer documentation and look for the __call__ method you will see which parameters it accepts.

The most relevant for us right now are text and text_pair. Both accept the same types described below.

text + text_pair

Both sentences are merged into a 1D vector, separated with the [SEP] = 102 token.

tokenizer(text="hello i am going, to", text_pair="Rome")
# {'input_ids': [101, 7592, 1045, 2572, 2183, 1010, 2000, 102, 4199, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Parameters accepted in text / text_pair

str

This is the most simple case. You only input one str. This can be a word or a sentence.

tokenizer("hello i am going to Rome")
#{'input_ids': [101, 7592, 1045, 2572, 2183, 2000, 4199, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

List[str]

Here we define a batch of sentences to be tokenized.

tokenizer(["hello i am going","to","Rome"])
# {'input_ids': [[101, 7592, 1045, 2572, 2183, 102], [101, 2000, 102], [101, 4199, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0], [0, 0, 0], [0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1], [1, 1, 1]]}

List[List[str]]

Now this is the case you are trying to do. However, even though it is not explicit anywhere (at least I did not find it), this list of lists is not for sequences of any length. As far as I understood, the outer list refers to the batches. However, the inner list is for a pair of sentences that will be treated like text - text_pair. If you enter more than 2 sentences in a batch, you will get an error.

Here is an example for two batches adding the special token manually.

encoded_input = tokenizer(text=[["hello i am going [SEP] to [SEP] Rome", "other sentence"], ["next", "batch"]])

print("input:   ", encoded_input)
print("batch 1: ", tokenizer.decode(encoded_input["input_ids"][0]))
print("batch 2: ", tokenizer.decode(encoded_input["input_ids"][1]))
# Output:
# input:    {'input_ids': [[101, 7592, 1045, 2572, 2183, 102, 2000, 102, 4199, 102, 2060, 6251, 102], [101, 2279, 102, 14108, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1], [0, 0, 0, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}
# batch 1:  [CLS] hello i am going [SEP] to [SEP] rome [SEP] other sentence [SEP]
# batch 2:  [CLS] next [SEP] batch [SEP]

As you can see, the manually added special tokens are treated the same when encoding and decoding the sentence. The problem is that token_type_ids can only be 0 or 1, there are no more ids for a third sentence.

You can use a visualization tool like BertViz to see how your sentences behave and if it is your desired behavior.

  • No ```[`SEP]`` token

going attends to to
image

  • With [`SEP]`` token goingattends to[SEP], almost does not notice to```
    image

Hope this helps :smile: