Quick question about the torch.utils.data.Sampler class and the DataLoader.
What is the behaviour of
self.data_source = data_source in the definition of all Samplers?
I have my custom data.Dataset class that needs a very expensive initialisation: does calling
self.data_source = data_source in the
__init__ of the sampler init a copy of the dataset?
My custom data.Dataset also has a variable
self.dataset_index (instance of
pandas.DataFrame) that is pretty much all I would need in to access from the sampler. Does the
data_source parameter strictly need to be an instance of
data.Datset or anything with a
As always thanks in advance!
Also found that
len(pd.DataFrame) is a single number while
pd.DataFrame.__len__() returns the two dimensions and the whole table …
the data_source is referring to a Dataset object. This dataset won’t become reinitialised in the sampler. The sampler simply calls the dataset.len() function so that it knows how many indexes it is supposed to produce.
I’m not sure, but I’m guessing that it would work with anything with a
__len__. Report back if you find out
I find difficult to relate len method of the Dataset class and len method of the Sampler class?
What happens when they differ?
From phone. So if your sampler is longer than dataset it would produce an index that would be out of range. Like ~> dataset[high_ind] ~> out of range error
If lower, some data points would never be used. You could handle these cases in your dataclass I guess, but not sure if there is a situation where this makes sense