Hi everyone
Quick question about the torch.utils.data.Sampler class and the DataLoader.
What is the behaviour of self.data_source = data_source in the definition of all Samplers?
I have my custom data.Dataset class that needs a very expensive initialisation: does calling self.data_source = data_source in the __init__ of the sampler init a copy of the dataset?
My custom data.Dataset also has a variable self.dataset_index (instance of pandas.DataFrame) that is pretty much all I would need in to access from the sampler. Does the data_source parameter strictly need to be an instance of data.Datset or anything with a __len__ method?
the data_source is referring to a Dataset object. This dataset won’t become reinitialised in the sampler. The sampler simply calls the dataset.len() function so that it knows how many indexes it is supposed to produce.
I’m not sure, but I’m guessing that it would work with anything with a __len__. Report back if you find out
From phone. So if your sampler is longer than dataset it would produce an index that would be out of range. Like ~> dataset[high_ind] ~> out of range error
If lower, some data points would never be used. You could handle these cases in your dataclass I guess, but not sure if there is a situation where this makes sense