Corrupted images in ImageNet

papyan · May 23, 2018, 3:02am

Dear PyTorch users,

I wrote a script that loops over the imagenet dataset and computes the top1 and top5 accuracies on both the train and validation sets. For validation the loop works perfectly. However for train I get the following error:

OSError: cannot identify image file <_io.BufferedReader name=‘/path_to_imagenet/train/n04266014/n04266014_10835.JPEG’>

After some investigation I found that the image which can not be loaded is in fact of size 0 bytes, meaning it’s corrupted. Not only that but also all the images in that class are of size 0 bytes. At that point I thought that perhaps the issue is with the downloaded imagenet dataset so I checked the md5sum. However, it was equal to the one posted on the imagenet website.

At this point I am wondering:

Does the imagenet dataset contain images that are of size 0 bytes (to the point where a whole class contains corrupted images)?
Is the ImageFolder class of pytorch supposed to be able to handle those 0 byte images?
What other explanation could there be for why I have 0 byte images except the downloaded tar file being broken?
Could this be related to confusion between PIL and Pillow in my conda setup?

Thanks in advance,
V