Documentation/tutorial data length mistake

KamWithK · January 4, 2020, 12:05am

Hey guys,
I just started using PyTorch and have been going through a lot of tutorials/documentation recently.
I think I’ve found an error in a few places where you calculate stats like accuracy and loss.
To calculate most stats you divide some running metric by the data set’s size. There are around 3 ways I’ve found online to accomplish this (all from my experimentation so far have yielded incorrect results):

len(data_loader) - when using a batch size greater than 1, this seems to return the batch size (not what we’re looking for)
len(data_loader.dataset) - this works in some cases, however once you use a data sampler to create a dataloader which uses a smaller subset of your data this breaks down (as it doesn’t return the actual currently utilized data’s size)
len(data_loader) * batch_size - this results in a slightly off number as PyTorch adjusts the batch size when the length of the dataset you’re working with doesn’t divide perfectly with your batch size (i.e. mod isn’t 0)

What I’ve found though was that using len(data_loader.sampler) WILL return precisely what you’re looking for (length of the training/validation data by itself). Just wanted to point this out as I’ve spent a huge amount of time looking for why my loss/accuracy were always incredably small.
I’m willing to show my code if needed.

Official tutorials/documentation:
https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html - first case
https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html - second case

ptrblck · January 4, 2020, 4:45am

Thanks for pointing this out.
If I’m not mistaken, both linked tutorials use the second approach (which would work, since no custom sampler is used).

I’m not sure, if your suggestions would work in all use cases, e.g. using a batch_sampler.
However, I haven’t thought about all edge cases.

KamWithK · January 4, 2020, 11:09am

Not entirely sure about all cases, but I would have thought using a custom dataset/sampler would be quite common.
By the way the first link uses the first approach (just written in a slightly more generic form), I’ve copy pasted the relavent line here: dataset_sizes = {x: len(image_datasets[x]) for x in [‘train’, ‘val’]}

Is it possible to add a note or something on the tutorials/documentation to point out exceptions where the code won’t work? I feel like it would be extremely useful, seeing that this is an elusive bug to find,which can easily lead to misinterpreted stats.

ptrblck · January 5, 2020, 6:39am

Personally, I wouldn’t want to add the note into the mentioned tutorials, as they are working fine in the current state. In my opinion there are a lot of edge cases, where just copy pasting the tutorial might not work out of the box and I don’t see a custom sampler as the usual use case.

That being said, I think your suggestion would fit into the FAQ section.
Feel free to create an issue on GitHub and explain your proposal there, please.

KamWithK · January 5, 2020, 11:27am

Ok, thanks. Will do soon