Trouble in loading tensor of different size

Hi guys,

I’m trying to load point cloud data with dataloader. These data has different size like tensor (100,3) tensor (200, 3). I’m able to load them with my custom collate function, but there will be no batch dimension in this case. Could someone please help me with it? Thanks.

Could you share the code you are using to load the data?
You collate function would be of interest.
I assume the first dimension corresponds to the number of samples in the current file?
Is it not possible to load the samples in a Dataset without a custom collate_fn?

In short, the code looks like this

result = []
for i in range(len(data)):
    result.append(data[i])
result = torch.FloatTensor(np.concatenate(result), axis = 0)

So, the two tensor (100,3) (200,3) will combine into one tensor(300,3). I use another variable to count the number of rows of each tensor. As you can see, I lose the batch dimension by doing this. It’s ok for single GPU, but batch dimension is necessary for multi-GPU case.

Thanks for the info!
Do you want to use all 300 samples at once?
I see the disadvantage of this approach, since your batch size is not flexible anymore.
I’ve created a small example to use a list of different sized point clouds and load each sample in the Dataset:

class MyDataset(Dataset):
    def __init__(self, data):
        self.data = data
        self.lengths = [len(d) for d in self.data]
        self.len = sum(self.lengths)
        
    def __getdata__(self, index):
        for idx, length in enumerate(self.lengths):
            if (index - length) < 0:
                print('Using data[{}][{}]'.format(idx, index))
                x = self.data[idx][index]
                break
            index -= length
        return x
    
    def __getitem__(self, index):
        x = self.__getdata__(index)
        return x
    
    def __len__(self):
        return self.len


data = []
data.append(torch.randn(100, 3))
data.append(torch.randn(200, 3))
data.append(torch.randn(150, 3))

dataset = MyDataset(data)
print(len(dataset))
# Check a few indices
dataset[0]
dataset[50]
dataset[99]
dataset[100]
dataset[299]
dataset[300]
dataset[449]

loader = DataLoader(
    dataset,
    batch_size=10,
    shuffle=True,
    num_workers=2
)

for data in loader:
    pass

This will make sure to load only a single point, so that you can use the DataLoader as usual.
Let me know, if that works for you!

2 Likes

Thanks for the detailed solution. This is really helpful.

1 Like