Dataloader iterable


#1

Dear PyTorch community,

I am working on an optimization algorithm. This algorithm needs to take a random data in the dataloader at each iteration, so I do not have many epoch, but I have a max iteration variable (30000 for example). However, to implement it by the easiest way, I would have access to the dataset like I have access to a list:

for i_data in range(max_iter):
	data = trainloader[i_data % len(trainloader)]

but the data loader object is not iterable. Do you have a solution ? I am working with the CIFAR10 data set.


(Sabari Manohar) #2

One solution I can think off is that you can use Dataloader and set batch_size as 1 and shuffle to True. This will most probably work.


#3

Thank you for your answer.

Actually because I need to stock some information (so I need to keep a fix dataset), I would prefer to have a list and to pick a random element inside. This way suits better my problem :-).


(Issam H Laradji) #4

There has been a long discussion on this, have a look at this (it might help):


(Royi) #5

Did you find a solution?

I also defined DataLoader iterator on the built in MNIST data set.
I’m looking for a way to access batches randomly (Like by index).

Any idea?


#6

I did not find a solution by indexing.

However, you can custom RandomSampler class to fit your needs. When you call your data loader, you simple use as an arg : sampler = CustomRandomSampler.


(Royi) #7

Hi,

Is there an example for that?

Thank You.


#8

Everything is here : http://pytorch.org/docs/master/data.html

You can check the source code. It is note so hard. See data loader and randomSampler.


(Royi) #9

As you could imagine I went through documentation and couldn’t find how to do so.
I need a way to sample sub set of the a data loader object.

If there is an example how to do so, it would be great.


(Royi) #10

Could anyone please assist me with that?
See Dataloader iterable.


#11

If you have a look at the RandomSapler class and you understand it, it is almost done.

class RandomSampler(Sampler):
    """Samples elements randomly, without replacement.

   Arguments:
        data_source (Dataset): dataset to sample from
    """

   def __init__(self, data_source):
       self.data_source = data_source

   def __iter__(self):
       return iter(torch.randperm(len(self.data_source)).long())

   def __len__(self):
       return len(self.data_source)

How does it work ? It is sampling your dataset according to the permutation order (iter function). So how would you modify it to suit your needs ?

Ps : Read your post before posting please… Some typos in your previous post.


(Royi) #12

@tux,
I’m new to Python so I might miss the point.

I’d like to have something like:

trainData, trainLabels = trainLoader.SampleByIdx(batchIdx)

Assuming SampleByIdx is my own defined sampler.
It seems the magic in RandomSampler happens at return iter(torch.randperm(len(self.data_source)).long()).
I just don’t understand where the samples are loaded in this line.

Thank You.

P. S.
I think I fixed typos, thank you for letting me know.


#13

It is not loaded in this line. RandomSampler class is just a tool for the Dataloader class. As I said before, if you have a look to the Dataloader class, you will find this :

  if batch_sampler is None:
        if sampler is None:
            if shuffle:
                sampler = RandomSampler(dataset)
            else:
                sampler = SequentialSampler(dataset)
        batch_sampler = BatchSampler(sampler, batch_size, drop_last)

So your dataloader will store the data following this order (assuming you are in the sampler = RandomSampler(dataset) if condition):

torch.randperm(len(self.data_source)).long()

This is how RandomSampler works. So you want to custom the RandomSampler class in order to control how your data are loaded in your dataloader. Once it is done you will know which one you are currently working on when you enumerate through your dataloader.