About the details of shuffle in dataloader

bhanu · March 11, 2019, 1:16pm

I am trying to shuffle my dataset
Instead of using epochs i have written while loop for iterations

roughly my code looks like this

    trainloader = data.DataLoader(
        t_loader,
        batch_size=cfg["training"]["batch_size"],
        num_workers=cfg["training"]["n_workers"],
    )
......

    while i <= cfg["training"]["train_iters"] and flag:
        for (images, labels) in trainloader:
          .......

if i set shuffle as true … even in this case will it shuffle the data … I raised this doubt because i am not not at all using the concept of epochs here … just using iterations instead

alex.veuthey · March 11, 2019, 1:40pm

Hi and welcome to the forum

It should work, as the train loader is agnostic to the way you loop (while or for). The essential part is that you call for (images, labels) in trainloader, the fact that it’s called inside a while or for loop is not important.

You can verify that it does shuffle it by printing, for example, the labels at each iteration:

while i <= cfg["training"]["train_iters"] and flag:
    for j, (images, labels) in enumerate(trainloader):
        if j == 0:
            print(labels)
            break

(Note that this will be an infinite loop by itself, since i is not modified).

bhanu · March 11, 2019, 1:47pm

Thanks for the reply

Solved the issue,
But the point i was pointing to was instead of using epochs i am using iterations

According to Deep learning nomenclature

number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).

i was asking weather iterations , epochs wouldn’t cause any issue while using shuffle=True

alex.veuthey · March 11, 2019, 2:03pm

Actually, my bad, I just re-read your code and realized it depends on the way you update the i variable, and in your case, it probably wouldn’t work.

Iterations are just a different way of counting than epochs, and the data loader doesn’t care which one you use. However, you need to be careful when counting.

What I recommend doing is transforming your number of iterations into epochs, so that the traditional way of doing PyTorch training is respected (specifically: the shuffling can happen normally in the data loader), but keeping track of the iterations in your training loop (e.g. using the j variable in my short example before as an iteration index for the current epoch).

To conclude: it all depends on your use case, but if you want more iterations than there are mini-batches in the data loader (i.e. more than one epoch’s worth), you need to use the epochs counting system, so that the data loader is shuffled at each epoch.

imaluengo · March 11, 2019, 4:19pm

Every time you iterate over a dataset, python triggers dataset.__iter__() method, which will under the hood your sampler’s .__iter__() method.

If you set DataLoader(..., shuffle=True) you are in practice instantiating a RandomSampler inside the DataLoader. Thus, does not matter whether you iterate for few batches, epoch, or however you want to call it, every time you do:

for data, label in dataloader:  # iter -> triggers __iter__() -> shuffles data
    pass

Your data gets “shuffled” (internally, indexes pointing to different samples gets returned in random order).

The following Python code does trigger a similar behaviour:

iterator = iter(dataloader) # iter -> triggers __iter__() -> shuffles data
for i in range(100):
    images, labels = next(iterator)

Be careful in the last example, as if 100 > len(dataloader) it will trigger an exception (StopIteration), which is handled under the hood when you iterate over the dataset in a for loop, but not in this case.

bhanu · March 12, 2019, 5:14am

I get it but my aim is to run through entire dataset even after using randomsampler but i feel like if iterations are used there is a possibility that an image could be missed even after running through all the iterations

alex.veuthey · March 12, 2019, 7:14am

That should not be an issue, because every call to for (data, label) in dataloader: shuffles the data (if you set shuffle=True) but still traverses every image in the dataloader, at least in the single-GPU use case (not sure about the multi-GPU case though, because of the randomness and asynchronous behaviour).

As long as there are more iterations than there are mini-batches in the dataloader, you should be fine!

imaluengo · March 12, 2019, 7:31am

That is why I was adding the second example @bhanu. If you are iterating over the dataset:

for image, labels in dataloader:
    pass

You are ensured that the dataloader will visit EVERY single image (a full epoch). If you want to make your training iterations shorter than epochs, then you could do something like this:

N = len(data_loader) # length of your dataset
M = 100              # iteration length
T = 10               # total number of iterations
# Get iterator
iterator = iter(dataloader) # iter -> triggers __iter__() -> shuffles data
current = 0
for _ in range(T): # loop over number of iterations
    for _ in range(M): # your iteration length
        images, labels = next(iterator)  # grab next batch

        # do your batch thingy

        current += 1
        if current >= N:
            # we arrived to the end of the iterator -- reshuffle
            current = 0
            iterator = iter(dataloader) # iter -> triggers __iter__() -> shuffles data

    # End of iteration, save your "iteration stats"

In the above example, since you only grabbing the iterator of a dataset once outside the loop, all your inner loop iterations are ensured to go through all your data in random order.