[Solved] Will change in dataset be reflected on dataloader automatically?

Hi, I contruct my own dataset according to the data loading tutorial and using the standard Dataloader provide by PyTorch. The code is like this,

train_set = MyDataset(some_parameter...)
train_loader = Dataloader(dataset=train_set, other_setting...)

for batch_idx , data, target in enumerate(train_loader):
     #training processing

After several training epoch, I would like to regenerate my training dataset, I have written a custom member funtion regenerate_sample() for MyDataset. So I can just use train_set.regenerate_sample() to change the samples in train_set.

But what I am not sure about is will this change be reflected on train_loader, i.e., will train_loader now generate batch of samples from the new dataset samples instead of the old dataset? Or do I have to manually construct a new Dataloader object in order to use the newly updated dataset? like the following,

train_set.regenerate_sample()
train_loader = Dataloader(dataset=train_set, other_setting...)

Which is the case? I am a little confused.

5 Likes

The DataLoader seems to use a reference to the Dataset object, so that your regenerate should work.
Here is a small sample code, which works fine for me:

class MyDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return len(self.data)

    def regenerate(self):
        self.data = torch.ones(3, 1).float()


data = torch.zeros(3, 1).float()
dataset = MyDataset(data)
print dataset[0] # should output 0

data_loader = DataLoader(dataset)

for d in data_loader:
    print d

dataset.regenerate()
print dataset[0] # should output 1

for d in data_loader:
    print d
6 Likes

Yes, seems that a reference to Dataset is used. Thanks for your example~

In case this helps anyone, I can confirm the switch successfully occurs for workers > 0. This means that any modifications done to the Dataset object will traverse to the workers as well.

5 Likes

What if we are using multiple machines each equipped with multiple GPUs? Should we only do the update of the dataset on rank 0?