Concat dataset and preserving their index

kwcckw · December 9, 2019, 1:44am

Hi,

I’m trying to concatenate more than 1 dataset and after the concatenation, looks like the index of the dataset would be not in order.

I’m having 6 images and labels related datasets, and want to pass them in a similar sequence.

This is the length of each dataset:
666
80
59
39
22
72

and this is how i concatenate them:
train_all = torch.utils.data.ConcatDataset({train_good,train_blow_hole,train_break,train_crack,train_fray,train_uneven})

train_all_label = torch.utils.data.ConcatDataset({train_label_good,train_label_blow_hole,train_label_break,train_label_crack,train_label_fray,train_label_uneven})

After the concatenation:

print(train_all.cumulative_sizes)
[72, 738, 797, 877, 899, 938]

print(train_all_label.cumulative_sizes)
[72, 111, 777, 836, 858, 938]

The index is not the same anymore and this would causing issue in the training stage later. Is there an efficient way to concatenate the dataset while preserving their index in the concatenated dataset?

Thanks

kwcckw · December 10, 2019, 1:16am

is there anyone have any idea to solve the issue or possibly to provide some reference?

beaupreda · December 11, 2019, 1:13pm

Hello,

I am not super familiar with ConcatDataset, but after looking at the docs, here is how I would proceed in your situation. First, I would create a CustomDataset class that inherits PyTorch’s torch.utils.data.Dataset class so that CustomDataset has both the data and the target of each individual dataset. Second, once you have your 6 CustomDataset objects, you can use torch.utils.data.ConcatDataset to form your new dataset. I believe your mistake is in separating the data from the labels, which should be in the same class. Here is a small example:

class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, data, target):
        self.data = data
        self.target = target

    def __getitem__(self, index):
        return self.data[index], self.target[index]

    def __len__(self):
        return len(self.data)

x = torch.rand(size=(666, 10), dtype=torch.float32)
x_labels = torch.zeros(size=(666,), dtype=torch.int32)
y = torch.rand(size=(80, 10), dtype=torch.float32)
y_labels = torch.ones(size=(80,), dtype=torch.int32)
z = torch.rand(size=(59, 10), dtype=torch.float32)
z_labels = torch.zeros(size=(59,), dtype=torch.int32)

x_dataset = CustomDataset(x, x_labels)
y_dataset = CustomDataset(y, y_labels)
z_dataset = CustomDataset(z, z_labels)

train_dataset = torch.utils.data.ConcatDataset([x_dataset, y_dataset, z_dataset])
print(train_dataset.cumulative_sizes)

Hope it helps!

kwcckw · December 12, 2019, 12:56am

Hi beaupreda,

Thanks and appreciate the information.

I didn’t aware of the inherit feature until you mention it, this sure looks useful.

Actually what i’m trying to do is perform segmentation classification with unet, so in each epoch, i would need to feed in a mask and an image. Do you aware of a better way to do this?

Anyway, i will proceed with your suggestion and see how it would work, thanks again