DataLoader loading hidden image files

I’m trying to load images using a DataLoader with a specific batch size to feed into a CNN.

mean = torch.tensor([0.5], dtype=torch.float32)
std = torch.tensor([0.5], dtype=torch.float32)

train_transforms = transforms.Compose([transforms.Resize(size=(700, 700)),
                                       transforms.RandomHorizontalFlip(p=0.5),
                                       transforms.RandomPerspective(distortion_scale=0.1, p=1),
                                       transforms.Grayscale(1),
                                       transforms.ToTensor(),
                                       transforms.Normalize(mean, std)]) 

test_transforms = transforms.Compose([transforms.Resize((700, 700)),
                                      transforms.ToTensor(),
                                      transforms.Normalize(mean, std)]) 

train_data = ImageFolder(train_dir, transform=train_transforms)
test_data = ImageFolder(test_dir, transform=test_transforms)

train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=32, shuffle=True)

When loading the (images, labels) pair vis next:

data_iter = iter(data_loader)
images, labels = next(data_iter)

I get this error:

UnidentifiedImageError: cannot identify image file <_io.BufferedReader name='./data/train/1. angle-high/._Screen Shot 2020-11-29 at 8.23.59 AM.png'>

Checking for hidden files using command+shift+. in MacOS, I can’t find any hidden files (starting with “._” in the directory).

Hi. Yep, mac makes this kind of files in their file system.
You have couple of options here:

  1. Make sure there are no files like this in your folder. You can check in mac terminal with ls -la which will show files starting with dots and remove lets say files started with “._” manually. Don’t like this option, though.
  2. You can make you ImageFolder dataset robust to erroneus data. According to documentation you can use argument is_valid_file and provide with callable to evaluate the file validity. After that your dataset will ignore invalid files.

Minimal example:

import torchvision
from pathlib import Path

def check_valid(path):
    path = Path(path)
    return not path.stem.startswith('._')

ds = torchvision.datasets.ImageFolder('your_root_folder_here',  is_valid_file=check_valid)

Hope, it helps!

1 Like

@Alexey_Demyanchuk Thank you so much! This is a really clean solution, and it works - also, didn’t know about ls -la thanks for the tip!

1 Like