How to feed string data to a GPU to encode data? SentenceTransformers

rjurney · June 15, 2020, 11:49pm

I am using Bert sentence transformers to encode strings into sentence embeddings in a scalable way and I’ve run into a blocking problem: I can’t get the code to run on a GPU.

Without using a GPU I can pass individual strings to SentenceTransformer.encode() (a Model) and get vector representations. When I use a GPU I get a GPU error that arguments are located on different machines. How do I get string data to the GPU? The docs I find indicate this can’t be done and I need to find a way to batch a bunch of strings to get encoded at once on the GPU. When I pass a numpy array of strings to Model.encode(), I get an exception: arguments are located on different GPUs. How do you get string data on a GPU?

Please help!

I looked up how to send a numpy array of strings to a GPU using PyTorch and there seems to be no way to do that. I looked at DataParallel and it doesn’t seem to encode data, and I can’t get it to work.

I start out creating a SentenceTransformer using a pooling model and I can run it on one string at a time. The encode() method takes a string or list of strings as an argument. Without sending it to the GPU it works fine. I can’t figure out how to get a list or numpy array of strings to the GPU! It is maddening, there is nothing out there on this. It must be possible because sentence transformers operate on strings. But how?

device = torch.device("cuda:0")
...

sentence_model = SentenceTransformer(
    modules=[bert_model, pooling_model]
)
sentence_model = sentence_model.to(device)

RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/THC/generic/THCTensorIndex.cu:400

When I look at the series, it is a numpy array of .

In [21]: series.dtype
Out[21]: dtype('O')

In [22]: series.values
Out[22]:
array(['I like cats but also dogs', 'I like cars full of cats',
       'I like apples but not oranges', ..., 'I like cars full of cats',
       'I like apples but not oranges', 'I like tutus on ballerinas.'],
      dtype=object)

When I try to create a tensor to get the data on a GPU, I run into this problem:

In [20]: torch.from_numpy(series.values)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-d0a262ca0b02> in <module>
----> 1 torch.from_numpy(series.values)

TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, int64, int32, int16, int8, uint8, and bool.

So I am very much stuck. How do you encode multiple strings on a GPU?

vdw · June 16, 2020, 12:44am

PyTorch tensors can only contain numerical values. You try to use strings. For NLP task, the required steps is to map your vocabulary of words to indexes and than map your strings to lists of indexes.

For example, your mapping might look like {‘I’: 0, ‘like’: 1, ‘cats’: 2, ‘but’: 3, …"}. Then you can convert your strings to something like

array([ [0 1 2 3 4 5] [0 1 32 16 8 3], ...], dtype=int_)

Now you can torch.from_numpy() on this. Once you have proper tensor, moving it to the GPU should be now problem.

You might look into torchtext. It comes with a lot of these basic functionalities to handle text (i.e., creating the vocabulary, creating the mappings, convert your strings to list of indexes, etc.)

rjurney · June 16, 2020, 12:59am

Thank you, I am aware of how to use encode from the tokenizer first. This is just very confusing because SentenceTransformers.encode() takes a list of strings. I don’t know why it does that. Thanks, I’ll look at torchtext.

rjurney · June 26, 2020, 5:22pm

What I’ve found is that SentenceTransformers is not scalable because it doesn’t use a GPU for encoding records as it uses strings and you can’t create a tensor of these strings to put on a GPU because PyTorch has no string Tensor type. In my application I have to encode the data in BERT on many cores as it isn’t GPU accelerated and then feed in integer encoded data into SentenceTransormers. I’m having to alter the library to do this.

Chandrayee · October 21, 2020, 6:51pm

I converted the model to device and that automatically sped up the encoding with SentenceTransformer. I did not have to change the data to device.