How to make BertTokenizer return GPU tensors instead of CPU tensors?

Jack_N · July 13, 2022, 7:22pm

I am wondering how I can make the BERT tokenizer return tensors on the GPU rather than the CPU. I am following the sample code found here: BERT. The code is below.

My question is about the 5th line of code, specifically how I can make the tokenizer return a cuda tensor instead of having to add the line of code inputs = inputs.to("cuda").

from transformers import BertTokenizer, BertForPreTraining

import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

model = BertForPreTraining.from_pretrained("bert-base-uncased")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

outputs = model(**inputs)

prediction_logits = outputs.prediction_logits

seq_relationship_logits = outputs.seq_relationship_logits

thecho7 · July 14, 2022, 8:46am

I found a class BatchEncoding which has a function to to allocate the result tensor to certain device.
It is inherited to BaseTokenizer but I think it is way better to cast input.to(device) instead of doing something to create BertTokenizer instance.

These are the references,

github.com

huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L1452


      
                      A special token representing a masked token (used by masked-language modeling pretraining objectives, like
                      BERT). Will be associated to `self.mask_token` and `self.mask_token_id`.
                  additional_special_tokens (tuple or list of `str` or `tokenizers.AddedToken`, *optional*):
                      A tuple or a list of additional special tokens. Add them here to ensure they won't be split by the
                      tokenization process. Will be associated to `self.additional_special_tokens` and
                      `self.additional_special_tokens_ids`.
          """
          
          

          
@add_end_docstrings(INIT_TOKENIZER_DOCSTRING)
          class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
              """
              Base class for [`PreTrainedTokenizer`] and [`PreTrainedTokenizerFast`].
          
          
    Handles shared (mostly boiler plate) methods for those two classes.
              """
          
          
    vocab_files_names: Dict[str, str] = {}
              pretrained_vocab_files_map: Dict[str, Dict[str, str]] = {}
              pretrained_init_configuration: Dict[str, Dict[str, Any]] = {}
              max_model_input_sizes: Dict[str, Optional[int]] = {}

github.com

huggingface/transformers/blob/main/src/transformers/models/bert/tokenization_bert.py#L137


      
          
          
def whitespace_tokenize(text):
              """Runs basic whitespace cleaning and splitting on a piece of text."""
              text = text.strip()
              if not text:
                  return []
              tokens = text.split()
              return tokens
          
          

          
class BertTokenizer(PreTrainedTokenizer):
              r"""
              Construct a BERT tokenizer. Based on WordPiece.
          
          
    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
              this superclass for more information regarding those methods.
          
          
    Args:
                  vocab_file (`str`):
                      File containing the vocabulary.
                  do_lower_case (`bool`, *optional*, defaults to `True`):