Invalid batch type

Hi guys, I am new to numpy, Pytorch and have a problem as shown below. My training set is in a .tsv file with 3 columns: Quality( 1 indicates the 2 sentences are similar and 0 is the opposite), #1 String (1st String), #2 String (2nd String).
I have tried convert everything to type: list but somehow when it goes through the DataLoader(), it changes to type: object (Just my assumption, I am quite new to this).
Can you guys fix this for me or any other suggestion? Thank you!

def get_dataloaders(ds, lengths=[0.6, 0.2, 0.2], batch_size=32, seed=42, num_workers=2):
    train_set, val_set, test_set = random_split(ds, lengths=lengths, generator=torch.Generator().manual_seed(seed))

    train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=num_workers)
    val_loader = DataLoader(val_set, batch_size=batch_size, shuffle=False, num_workers=num_workers)
    test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False, num_workers=num_workers)
    return train_loader, val_loader, test_loader

data_dir = "/data.tsv"
data = pd.read_csv(data_dir, sep='\t')
y = data['Quality'].values                    # dtype: int64
X = data[['#1 String', '#2 String']].values   # dtype: O
data_input = np.column_stack((X, y))          # dtype: O

train_loader, val_loader, test_loader = get_dataloaders(data_input)

for batch in train_loader: #TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object
   print("1")
----------------------------------------------------------------------------------

TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 265, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 119, in collate
    return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 169, in collate_numpy_array_fn
    raise TypeError(default_collate_err_msg_format.format(elem.dtype))
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object

Edit: I know where I misunderstood, I kept looking at X and y types instead of its dtype. But I assume when combining 2 columns into 1 (X) then it will always be dtype: Object?

What does print(data_input.dtype) return?
I guess you are mixing different types into the numpy.array and it’s thus containing objects instead of a common numerical dtype.

Thankyou for pointing that out, I have updated the post!

No, as it depends on the inputs as seen in this example:

data_input = np.column_stack((np.random.randn(10, 10), np.random.randn(10, 10)))
print(data_input.dtype)
# float64

data_input = np.column_stack((np.random.randn(1), np.array([dict()])))
print(data_input.dtype)
# object

If you are using “mixed” types and numpy cannot convert them to a common dtype it will store them as objects.

1 Like

Hmm, it was quite strange:

  • My column data[‘Quality’] contains all integers, I tried printing data[‘Quality’].dtype and it was what I thought: int64.
  • My column data[‘#1 String’] contains all strings (I check it by looping through all elements in the column and print those type which are not string, nothing was printed out so it means all of them are strings), because of that I expected data[‘#1 String’].dtype to be ‘str’. But then it showed Object. I tried to cast its type : data['#1 String'] = data['#1 String'].astype(str) but it still showed Object.
    I can’t figure out what happened.

Which output dtype would you expect if you concatenate integers with strings?

I think it is Object?

Yes, it is because there is no other common dtype, but I understand you are expecting something else?

I understand what you said in this reply, I see what I mistaken, thank you! But my question below this is not relevant to the previous one (sorry I should mention this earlier).
What I want to ask in my newest question is: I suppose the column containing all ‘str’ elements should has ‘str’ dtype, but when I printed out, the dtype was Object, so why is that?

Probably because this column itself does not contain strings only but mixed types.