DataLoader behaves differently for numpy and list type

The only difference is one of the parameter passed to DataLoader is in type “numpy.array” and the other is in type “list”, but the DataLoader gives totally different results.

You can use the following code to reproduce it:

from torch.utils.data import DataLoader,Dataset
import numpy as np

class my_dataset(Dataset):
    def __init__(self,data,label):
        self.data=data
        self.label=label          
    def __getitem__(self, index):
        return self.data[index],self.label[index]
    def __len__(self):
        return len(self.data)

train_data=[[1,2,3],[5,6,7],[11,12,13],[15,16,17]]
train_label=[-1,-2,-11,-12]

########################### Look at here:    

test=DataLoader(dataset=my_dataset(np.array(train_data),train_label),batch_size=2)
for i in test:
    print ("numpy data:")
    print (i)
    break


test=DataLoader(dataset=my_dataset(train_data,train_label),batch_size=2)
for i in test:
    print ("list data:")
    print (i)
    break

The result is:

numpy data:
[tensor([[1, 2, 3],
        [5, 6, 7]]), tensor([-1, -2])]
list data:
[[tensor([1, 5]), tensor([2, 6]), tensor([3, 7])], tensor([-1, -2])]  

the original question is Here

1 Like

Yes it’s behavior of the default collate_fn. Basically elements of the list are considered as different things, but np array is similar to torch tensor.

So, the input dataset type should be converted to array type, explicitly?

Well often they are tensor types. You can also use your own collate function.

1 Like