OSError: image file is truncated (28 bytes not processed) during learning

IdeoG · February 10, 2019, 11:58am

Hello,
I’m working on semantic segmentation project on pytorch framework. I wrote example code for UNet model with n_classes=1 and run it on windows 10 PC. Everything was great but get a lot of time because of poor gpu. Env was created via conda:
python = 3.6.6
PIL=5.4.1
pytorch=1.0.1
So, i gone to dedicated server with 1080 ti ubuntu 18.04 lts. Created the same env and checked package versions - all right. After that i moved source code and dataset to dedicated server and run it. But after first epoch i get following exception:
https://hastebin.com/medolataxa.sql
OSError: image file is truncated (28 bytes not processed)
I haven’t use truncated files. Everything was okay on windows 10, but after first epoch everything collapsed. (p.s. use allow truncated_files doesn’t solve the issue).
It is also strange, that PIL.Image.open(<image_path>) recursive running on dataset didn’t throw any exception with files on dedicated server.
Is someone know how to fix it?

ptrblck · February 10, 2019, 12:22pm

Maybe there were some issues moving these files to the server?
Could you try to load all images in a loop, store the index which gives you the error, and have a look at this particular file?
Something like this should give you the index:

for idx, (data, target) in enumerate(dataset):
    print(idx)

The error should then be related to idx+1. Depending on the Dataset you are using, you could try to get the corresponding image path and check the file manually.

IdeoG · February 10, 2019, 1:15pm

Thank you @ptrblck for yor reply,

I wrote custom Dataset. Here is a code:

class DatasetLoader(Dataset):

    def __init__(self, X, y, input_transform=None, label_transform=None):

        self.data = X
        self.labels = y
        self.input_transform = input_transform
        self.label_transform = label_transform

    @staticmethod
    def load_dataset(data_dir: str):
        logger.debug(f"load_dataset: Loading dataset from {data_dir}")

        inputs_dir = f'{data_dir}/inputs'
        labels_dir = f'{data_dir}/labels'

        inputs = []
        for image_path in tqdm(glob.glob(inputs_dir + '/*')):
            image = Image.open(image_path)
            inputs.append(image)

        labels = []
        for image_path in tqdm(glob.glob(labels_dir + '/*')):
            label = Image.open(image_path).convert('L')
            labels.append(label)

        return inputs, labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        data = self.data[idx]
        if self.input_transform is not None:
            data = self.input_transform(data)

        label = self.labels[idx]
        if self.label_transform is not None:
            label = self.label_transform(label)
        return data, label

Also i zipped dataset folder, moved to dedicated server via scp and check sha256sum. Everything was good.

ptrblck · February 10, 2019, 1:19pm

Thanks for the code!
You could add a try ... expect block around the loop where Image.open is called and see, which image makes problems.
Maybe you are using different PIL versions, such that this issue was fixed on your local machine?

IdeoG · February 10, 2019, 1:21pm

Actually, image loading works good. I checked it in jupyter. Also, code using shown below:

def load_datasets(data_dir, input_size, test_pct=0.2, eval_size=10):
    train_transform, test_transform, label_transform = create_transforms(input_size)

    X, y = DatasetLoader.load_dataset(data_dir)
    train_slice = round((1 - test_pct) * len(X))

    train_data = DatasetLoader(X[:train_slice], y[:train_slice],
                               input_transform=train_transform, label_transform=label_transform)
    test_data = DatasetLoader(X[train_slice:], y[train_slice:],
                              input_transform=test_transform, label_transform=label_transform)
    eval_data = DatasetLoader(X[-eval_size:], y[-eval_size:],
                              input_transform=test_transform, label_transform=label_transform)

    logger.debug(f"load_datasets: (train_data, test_data, eval_data) sizes = "
                 f"{len(train_data), len(test_data), len(eval_data)}")
    return train_data, test_data, eval_data

So, images get loaded before iteration within trainloader

ptrblck · February 10, 2019, 1:23pm

The error message point to PIL.ImageFile.load, which is weird if image loading is not an issue.
Could you create a (small) code snippet to reproduce this error?

IdeoG · February 10, 2019, 1:24pm

Wait a minute, i’ll provide sample code for train and exact line where exception throws from.

IdeoG · February 10, 2019, 1:25pm

Here is my sample code for training:

with tensorboardX.SummaryWriter(log_dir=log_dir) as summary_writer:
        for epoch in range(epochs):
            epoch_train_loss = 0

            model.train()
            logger.debug(f"train: Running epoch {epoch + 1} out of {epochs}")
            for inputs, labels in tqdm(trainloader):
                inputs, labels = inputs.cuda(non_blocking=True), labels.cuda(non_blocking=True)
                outputs = model.forward(inputs)

                loss = criterion(outputs, labels)
                epoch_train_loss += loss.item()

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

IdeoG · February 10, 2019, 1:26pm

Also some utils:

def create_transforms(input_size):
    channel_means = [0.485, 0.456, 0.406]
    channel_stds = [0.229, 0.224, 0.225]

    train_tfms = transforms.Compose([transforms.Resize(input_size),
                                     transforms.ToTensor(),
                                     transforms.Normalize(channel_means, channel_stds)])
    test_tfms = transforms.Compose([transforms.Resize(input_size),
                                    transforms.ToTensor(),
                                    transforms.Normalize(channel_means, channel_stds)])
    mask_tfms = transforms.Compose([transforms.Resize(input_size),
                                    transforms.ToTensor()])
    return train_tfms, test_tfms, mask_tfms

def create_dataloaders(data_dir, input_size=256, test_pct=0.2, batch_size=64) -> (DataLoader, DataLoader, DataLoader):
    train_data, test_data, eval_data = load_datasets(data_dir, input_size, test_pct)

    trainloader = DataLoader(train_data, batch_size=batch_size, shuffle=True, num_workers=6, pin_memory=True)
    testloader = DataLoader(test_data, batch_size=batch_size, shuffle=False, num_workers=6, pin_memory=True)
    evalloader = DataLoader(eval_data, batch_size=1, shuffle=False, num_workers=6, pin_memory=True)
    return trainloader, testloader, evalloader

trainloader, testloader, evalloader = create_dataloaders(data_dir, test_pct=test_pct, batch_size=batch_size)

IdeoG · February 10, 2019, 1:29pm

As you can see, the exception throws on idx=0 within __iter__() and __next__() func

IdeoG · February 10, 2019, 3:08pm

@ptrblck

I had an idea that jpg image compression is different on win10 and linux machines depends on a lib. So i converted all jpg images to png on win10 and send zipped folder to linux machine via scp. But again the same error. First iteration on trainloader is okay, but it’s look like images get corrupted after read, because then we are trying to read same trainloader at a time, we get truncated image.

Any ideas?

IdeoG · February 10, 2019, 3:32pm

hmm, i tried to do random things and found that in my ubuntu server 6 instances of python file, because num_workers = 6. So, i removed num_workers from DataLoader object creation and it worked.

Could you provide some observations why did it happen?@ptrblck
Best regards,
Alex.

IdeoG · February 10, 2019, 7:01pm

@smth Any ideas?

ptrblck · February 10, 2019, 7:13pm

I’m really not sure, why PIL throws an error, if you use multiple workers.
Using

from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

did not solve the error, right?
Could you upload the image somewhere? I could try to reproduce this issue on my Ubuntu machine.

IdeoG · February 10, 2019, 7:19pm

I get truncated and corrupted images anyway, that doesn’t allowed for learning.

IdeoG · February 10, 2019, 7:20pm

I use LFW dataset. Downloaded from here http://vis-www.cs.umass.edu/lfw/part_labels/

ptrblck · February 10, 2019, 8:20pm

Thanks for the link.
I’ve downloaded the lfw_funneled dataset and it’s running fine:

path = './lfw_funneled/'

dataset = datasets.ImageFolder(
    path,
    transform=transforms.ToTensor()
)

# Check Dataset
for image, target in dataset:
    print(target)

# Check DataLoader
loader = DataLoader(
    dataset,
    num_workers=6,
    shuffle=False)

for data, target in loader:
    print(target)

No PIL errors on my machine with 6 workers.
I’m using Ubuntu 18.04.1 LTS, and PIL 5.4.1.

IdeoG · February 11, 2019, 5:14am

Try to run more than one epoch. First epoch, as i wrote before, was fine.

smth · February 11, 2019, 5:57am

it’s a clear case of a corrupted image (I’ve seen this error before).
Print the image filename as well, and inspect / redownload / delete the bad image.

ptrblck · February 11, 2019, 10:11am

I don’t get any errors for 5 epochs and still think it’s related to your image.
Did you try to redownload or delete the image as suggested?