DataLoader behaves differently for numpy and list type

dreamyun · October 16, 2018, 12:19am

The only difference is one of the parameter passed to DataLoader is in type “numpy.array” and the other is in type “list”, but the DataLoader gives totally different results.

You can use the following code to reproduce it:

from torch.utils.data import DataLoader,Dataset
import numpy as np

class my_dataset(Dataset):
    def __init__(self,data,label):
        self.data=data
        self.label=label          
    def __getitem__(self, index):
        return self.data[index],self.label[index]
    def __len__(self):
        return len(self.data)

train_data=[[1,2,3],[5,6,7],[11,12,13],[15,16,17]]
train_label=[-1,-2,-11,-12]

########################### Look at here:    

test=DataLoader(dataset=my_dataset(np.array(train_data),train_label),batch_size=2)
for i in test:
    print ("numpy data:")
    print (i)
    break


test=DataLoader(dataset=my_dataset(train_data,train_label),batch_size=2)
for i in test:
    print ("list data:")
    print (i)
    break

The result is:

numpy data:
[tensor([[1, 2, 3],
        [5, 6, 7]]), tensor([-1, -2])]
list data:
[[tensor([1, 5]), tensor([2, 6]), tensor([3, 7])], tensor([-1, -2])]

the original question is Here

SimonW · October 16, 2018, 12:23am

Yes it’s behavior of the default collate_fn. Basically elements of the list are considered as different things, but np array is similar to torch tensor.

dreamyun · October 16, 2018, 1:00am

So, the input dataset type should be converted to array type, explicitly?

SimonW · October 16, 2018, 3:00am

Well often they are tensor types. You can also use your own collate function.