Images not read properly anymore after an epoch of successful training

Hi,

I’m working on video classification, where I have extracted the frames from each video in my dataset, preprocessed them (i.e., cropped faces), and saved them as .png images. I was able to run one epoch of training and validation, but then I noticed that running any more epochs always resulted in an error, because some images would never be read properly anymore, neither with cv2 nor with PIL.Image (I tried both). So, if I rerun the training process again, it will throw an error in the first epoch of that run. When I locate a certain “invalid” image and run the following code:

from PIL import Image
import numpy as np

path = ''  # path to image
img = Image.open(path)
im = np.array(img)

I get the following error:

OSError: unrecognized data stream contents when reading image file

If I try:

from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

then I do not get this error, but the image then contains many zeros at the end (these zeros are not part of the original image).

Strangely, if I run the code (without the LOAD_TRUNCATED_IMAGES) for opening the image in the same conda environment but on a different machine, then the image opens fine, which suggests that the image is not actually corrupted.

I also tried creating a new conda environment on the problematic machine and installing only the newest version of PIL, but the same error occurs.

Further, I deleted all images and then pre-processed all the frames again but the problem repeats: first epoch of training is fine, then certain images (many but not all) cannot be read again.

Here’s how I read the clips in my custom Dataset. I have only included the relevant methods from the class:

    def get_clip(self, idx):
        video_idx = bisect.bisect_right(self.cumulative_sizes, idx)
        if video_idx == 0:
            clip_idx = idx
        else:
            clip_idx = idx - self.cumulative_sizes[video_idx - 1]

        path = self.paths[video_idx]
        frames = sorted(os.listdir(os.path.join(self.root, path)))
        start_idx = clip_idx * (self.frames_per_clip * self.frame_dilation + self.step_between_clips - 1)
        end_idx = start_idx + self.frames_per_clip * self.frame_dilation
        video = []
        for idx in range(start_idx, end_idx, self.frame_dilation):
            # img = cv2.imread(os.path.join(self.root, path, frames[idx]))
            # img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
            img = Image.open(os.path.join(self.root, path, frames[idx]))
            img = np.array(img)
            video.append(torch.from_numpy(img))
        
        video = torch.stack(video)

        return video, video_idx

    def __getitem__(self, idx):
        video, video_idx = self.get_clip(idx)
        if video_idx < self.videos_per_type['youtube'] + self.videos_per_type['real']:
            label = 0
        else:
            label = 1
        label = torch.tensor(label, dtype=torch.float32).unsqueeze(-1)

        if self.transform is not None:
            video = self.transform(video)

        return video, label, video_idx

I would massively appreciate any help regarding this strange issue, as I am currently out of ideas. Please let me know if you need more information. Thanks!

Could you try to open the images with PIL, convert to numpy as to copy them and close the PIL files so you are sure that you are not manipulating the images somwhere by mistake. As it is being read properly the first time but not the second i.e next epoch.

Thank you for your suggestion. I should clarify that it only trains for one epoch once the images have been deleted and re-processed. Once this epoch is run, any subsequent reading of the files from that machine is not possible, i.e., gives the error above. So, if I rerun the training process it will throw an error in the first epoch of that run. I will edit my post to clarify.

Actually it is clear, what i meant is that you could open copy close so you can be sure that you are not manipulating or writing to the image file itself. I rember having similar isues not this bad when I do not close image files properly

OK, I see, I will delete and re-process the images and then follow your suggestion. Thanks!

I wonder though: If the images are accidentally written into, isn’t it strange that from another machine (and same conda environment) I can read the images fine? Doesn’t this suggest that they are not actually corrupted?

Ya you are right I missed that detail. If it is the case that the image file is intact, it would imply that in the second epoch the dataloader isnt reading from the actual file. I might not make sense but so seems the problem atm, i am curious at this point,
is the pin_memory true, then try false?
I cant think of any other way you are not accessing ‘fresh’ data

pin_memory is set to False

Any luck solving this issue?