OSError: image file is truncated (28 bytes not processed) during learning

Thank you @ptrblck for yor reply,

I wrote custom Dataset. Here is a code:

class DatasetLoader(Dataset):

    def __init__(self, X, y, input_transform=None, label_transform=None):

        self.data = X
        self.labels = y
        self.input_transform = input_transform
        self.label_transform = label_transform

    @staticmethod
    def load_dataset(data_dir: str):
        logger.debug(f"load_dataset: Loading dataset from {data_dir}")

        inputs_dir = f'{data_dir}/inputs'
        labels_dir = f'{data_dir}/labels'

        inputs = []
        for image_path in tqdm(glob.glob(inputs_dir + '/*')):
            image = Image.open(image_path)
            inputs.append(image)

        labels = []
        for image_path in tqdm(glob.glob(labels_dir + '/*')):
            label = Image.open(image_path).convert('L')
            labels.append(label)

        return inputs, labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        data = self.data[idx]
        if self.input_transform is not None:
            data = self.input_transform(data)

        label = self.labels[idx]
        if self.label_transform is not None:
            label = self.label_transform(label)
        return data, label

Also i zipped dataset folder, moved to dedicated server via scp and check sha256sum. Everything was good.

Thanks for the code!
You could add a try ... expect block around the loop where Image.open is called and see, which image makes problems.
Maybe you are using different PIL versions, such that this issue was fixed on your local machine?

Actually, image loading works good. I checked it in jupyter. Also, code using shown below:

def load_datasets(data_dir, input_size, test_pct=0.2, eval_size=10):
    train_transform, test_transform, label_transform = create_transforms(input_size)

    X, y = DatasetLoader.load_dataset(data_dir)
    train_slice = round((1 - test_pct) * len(X))

    train_data = DatasetLoader(X[:train_slice], y[:train_slice],
                               input_transform=train_transform, label_transform=label_transform)
    test_data = DatasetLoader(X[train_slice:], y[train_slice:],
                              input_transform=test_transform, label_transform=label_transform)
    eval_data = DatasetLoader(X[-eval_size:], y[-eval_size:],
                              input_transform=test_transform, label_transform=label_transform)

    logger.debug(f"load_datasets: (train_data, test_data, eval_data) sizes = "
                 f"{len(train_data), len(test_data), len(eval_data)}")
    return train_data, test_data, eval_data

So, images get loaded before iteration within trainloader

The error message point to PIL.ImageFile.load, which is weird if image loading is not an issue.
Could you create a (small) code snippet to reproduce this error?

Wait a minute, i’ll provide sample code for train and exact line where exception throws from.

Here is my sample code for training:

with tensorboardX.SummaryWriter(log_dir=log_dir) as summary_writer:
        for epoch in range(epochs):
            epoch_train_loss = 0

            model.train()
            logger.debug(f"train: Running epoch {epoch + 1} out of {epochs}")
            for inputs, labels in tqdm(trainloader):
                inputs, labels = inputs.cuda(non_blocking=True), labels.cuda(non_blocking=True)
                outputs = model.forward(inputs)

                loss = criterion(outputs, labels)
                epoch_train_loss += loss.item()

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

Also some utils:

def create_transforms(input_size):
    channel_means = [0.485, 0.456, 0.406]
    channel_stds = [0.229, 0.224, 0.225]

    train_tfms = transforms.Compose([transforms.Resize(input_size),
                                     transforms.ToTensor(),
                                     transforms.Normalize(channel_means, channel_stds)])
    test_tfms = transforms.Compose([transforms.Resize(input_size),
                                    transforms.ToTensor(),
                                    transforms.Normalize(channel_means, channel_stds)])
    mask_tfms = transforms.Compose([transforms.Resize(input_size),
                                    transforms.ToTensor()])
    return train_tfms, test_tfms, mask_tfms

def create_dataloaders(data_dir, input_size=256, test_pct=0.2, batch_size=64) -> (DataLoader, DataLoader, DataLoader):
    train_data, test_data, eval_data = load_datasets(data_dir, input_size, test_pct)

    trainloader = DataLoader(train_data, batch_size=batch_size, shuffle=True, num_workers=6, pin_memory=True)
    testloader = DataLoader(test_data, batch_size=batch_size, shuffle=False, num_workers=6, pin_memory=True)
    evalloader = DataLoader(eval_data, batch_size=1, shuffle=False, num_workers=6, pin_memory=True)
    return trainloader, testloader, evalloader

trainloader, testloader, evalloader = create_dataloaders(data_dir, test_pct=test_pct, batch_size=batch_size)


As you can see, the exception throws on idx=0 within __iter__() and __next__() func

@ptrblck

I had an idea that jpg image compression is different on win10 and linux machines depends on a lib. So i converted all jpg images to png on win10 and send zipped folder to linux machine via scp. But again the same error. First iteration on trainloader is okay, but it’s look like images get corrupted after read, because then we are trying to read same trainloader at a time, we get truncated image.

Any ideas?

hmm, i tried to do random things and found that in my ubuntu server 6 instances of python file, because num_workers = 6. So, i removed num_workers from DataLoader object creation and it worked.

Could you provide some observations why did it happen?@ptrblck
Best regards,
Alex.

@smth Any ideas? :slight_smile:

I’m really not sure, why PIL throws an error, if you use multiple workers.
Using

from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

did not solve the error, right?
Could you upload the image somewhere? I could try to reproduce this issue on my Ubuntu machine.

I get truncated and corrupted images anyway, that doesn’t allowed for learning.

I use LFW dataset. Downloaded from here http://vis-www.cs.umass.edu/lfw/part_labels/

Thanks for the link.
I’ve downloaded the lfw_funneled dataset and it’s running fine:

path = './lfw_funneled/'

dataset = datasets.ImageFolder(
    path,
    transform=transforms.ToTensor()
)

# Check Dataset
for image, target in dataset:
    print(target)

# Check DataLoader
loader = DataLoader(
    dataset,
    num_workers=6,
    shuffle=False)

for data, target in loader:
    print(target)

No PIL errors on my machine with 6 workers.
I’m using Ubuntu 18.04.1 LTS, and PIL 5.4.1.

Try to run more than one epoch. First epoch, as i wrote before, was fine.

it’s a clear case of a corrupted image (I’ve seen this error before).
Print the image filename as well, and inspect / redownload / delete the bad image.

I don’t get any errors for 5 epochs and still think it’s related to your image.
Did you try to redownload or delete the image as suggested?

I came here stuck on exactly this same issue.

Initially, setting:
ImageFile.LOAD_TRUNCATED_IMAGES=True
solved the problem. Although in that initial case, I was using num_workers=0.

In my case, it was reproducible that defining the loaders with num_workers > 0 would end up throwing the OSError exception some time during training.

As I understand it, num_workers=0 implies that processing is done in the same execution context as the training, whereas > 0 spawns other processes.

So my guess is that the spawned processes do not have ImageFile.LOAD_TRUNCATED_IMAGES=True set in them, so they will fail when trying to import a corrupted image.

If that suspicion is correct, is there any way to perpetuate that setting to the spawned workers?

Possible confounding factors for my case:

  • this is on Windows, as my only machine with a GPU is Windows (VR rig in the office :sweat_smile:)
  • I am running a pre-release build of Pillow (6.1.0.dev0), due to encountering this issue with my dataset:
    https://github.com/python-pillow/Pillow/issues/3769

Having multiple workers was important for my application because it seems that ~75% of the total training time is spent doing something other than just calculation, even with num_workers=10.

My manual fix was to use this code to go through my datasets to find the image that was causing problems:

import tqdm

for DUT in [train_dataset, valid_dataset, test_dataset]:
    for fn,label in tqdm.tqdm(DUT.imgs):
        try:
            im = ImageFile.Image.open(fn)
            im2 = im.convert('RGB')
        except OSError:
            print("Cannot load : {}".format(fn))

That did find one image that was unloadable, for my case.
(for any of the other Udacity Deep Learning Nanodegree folks who might find this via search, the file dogImages/train\098.Leonberger\Leonberger_06571.jpg was the unloadable file)

I trivially re-saved the file, which appears to have filled in any corrupted data, and the many-workers loader approach now works.

1 Like

thanks a lot for this post, I had exactly the same issue on my windows 10. I ended up simply removing the file mentioned from the dataset while keeping my num_workers >0 which resolved the issue!

1 Like