Data_source parameter in Sampler.__init__

(Davide) #1

Hi everyone :slight_smile:
Quick question about the torch.utils.data.Sampler class and the DataLoader.

What is the behaviour of
self.data_source = data_source in the definition of all Samplers?
I have my custom data.Dataset class that needs a very expensive initialisation: does calling
self.data_source = data_source in the __init__ of the sampler init a copy of the dataset?
My custom data.Dataset also has a variable self.dataset_index (instance of pandas.DataFrame) that is pretty much all I would need in to access from the sampler. Does the data_source parameter strictly need to be an instance of data.Datset or anything with a __len__ method?

As always thanks in advance!

(Davide) #2

Also found that len(pd.DataFrame) is a single number while pd.DataFrame.__len__() returns the two dimensions and the whole table …

(Olof Harrysson) #3

Hi,

the data_source is referring to a Dataset object. This dataset won’t become reinitialised in the sampler. The sampler simply calls the dataset.len() function so that it knows how many indexes it is supposed to produce.

I’m not sure, but I’m guessing that it would work with anything with a __len__. Report back if you find out :slight_smile:

(Davide) #4

I find difficult to relate len method of the Dataset class and len method of the Sampler class?
What happens when they differ?

(Olof Harrysson) #5

From phone. So if your sampler is longer than dataset it would produce an index that would be out of range. Like ~> dataset[high_ind] ~> out of range error

If lower, some data points would never be used. You could handle these cases in your dataclass I guess, but not sure if there is a situation where this makes sense