TypeError: can't convert np.ndarray of type numpy.object_

This is my data,

id label tweet
0 1 0 @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run

which is in text format, I have pre-processed it and then I want to fit a PyTorch LSTM model in it.
To fit the model I have to split the dataset into train and test set, and as PyTorch has a very interesting module called DataLoader to load the dataset, so we could use it. But as soon as I do this -

train_data = TensorDataset(torch.from_numpy(np.array(train_x)), torch.from_numpy(np.array(train_y)))
valid_data = TensorDataset(torch.from_numpy(np.array(valid_x)), torch.from_numpy(np.array(valid_y)))
test_data = TensorDataset(torch.from_numpy(np.array(test_x)), torch.from_numpy(np.array(test_y)))

It throws an error that
TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, int64, int32, int16, int8, and uint8. in the line of

----> 4 train_data = TensorDataset(torch.from_numpy(np.array(train_x)),torch.from_numpy(np.array(train_y)))

I have also printed the shape and type of the splitted datasets,

print('shape of training set: {}' .format(train_x.shape))
print('shape of valid set: {}' .format(valid_x.shape))
print('shape of test set: {}' .format(test_x.shape))

shape of training set: (32979,)
shape of valid set: (2910,)
shape of test set: (2910,)
print(train_x.dtype)

object

How can I solve this error? I have tried solutions like converting the object type to string or float, neither of them worked. I am not getting any solutions. Any help appriciated.

1 Like

I would suggest to encode the words as some indices similar to the Lang class in the Seq2Seq Tutorial.

So, here I am encoding my sorted words using this -

reviews_int = []
for review in combi['tidy_tweet']:
  r = [vocab_to_int[w] for w in review.split()]
  reviews_int.append(r)
print(reviews_int[0:3])
OUTPUT: [[38, 4, 86, 10, 14869, 6, 10, 22, 4161, 71, 9356, 97, 332, 245, 97, 14870, 1301], [170, 8, 11409, 2355, 3, 33, 18, 424, 625, 59, 70, 18, 1651, 14871, 14872, 7, 14873, 22887, 14874], [76, 25, 4162]]

Then I convert it to np.array like this - reviews_int = np.array(reviews_int)
Still when I try to print the dtype of the reviews_int, it shows -
print(reviews_int.dtype) object
How can I do this now? Is Seq2Seq is the only way? My approach is wrong?

I’m not sure, how you’ve processed the text, but the methods from the seq2seq tutorial seem to work for your example tweet:

text = '@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run'
text = normalizeString(text)
input_lang = Lang('tweet')
input_lang.addSentence(text)

encoded = [input_lang.word2index[w] for w in text.split(' ')]
input = torch.tensor(encoded)
print(input.type())
> torch.LongTensor

So, do I need apply those two methods?
normalizeString and Lang?
Also, what does it input_lang = Lang('tweet') mean?
why aren’t we passing text file there?

You don’t need the exact same class Lang, but could have a look at the underlying operations and how to transform each word to an index.
Also, zou don’t necessarily need normalizeString, but it might help cleaning the tweets.

The passed argument to Lang is just the name of the language, which I called “tweet”.

The most important part is, that using the input_lang.word2index dict, you’ll get valid word indices, which can be used to train a model.

1 Like