I came here stuck on exactly this same issue.
Initially, setting:
ImageFile.LOAD_TRUNCATED_IMAGES=True
solved the problem. Although in that initial case, I was using num_workers=0
.
In my case, it was reproducible that defining the loaders with num_workers
> 0 would end up throwing the OSError
exception some time during training.
As I understand it, num_workers=0
implies that processing is done in the same execution context as the training, whereas > 0 spawns other processes.
So my guess is that the spawned processes do not have ImageFile.LOAD_TRUNCATED_IMAGES=True
set in them, so they will fail when trying to import a corrupted image.
If that suspicion is correct, is there any way to perpetuate that setting to the spawned workers?
Possible confounding factors for my case:
- this is on Windows, as my only machine with a GPU is Windows (VR rig in the office )
- I am running a pre-release build of Pillow (6.1.0.dev0), due to encountering this issue with my dataset:
https://github.com/python-pillow/Pillow/issues/3769
Having multiple workers was important for my application because it seems that ~75% of the total training time is spent doing something other than just calculation, even with num_workers=10
.
My manual fix was to use this code to go through my datasets to find the image that was causing problems:
import tqdm
for DUT in [train_dataset, valid_dataset, test_dataset]:
for fn,label in tqdm.tqdm(DUT.imgs):
try:
im = ImageFile.Image.open(fn)
im2 = im.convert('RGB')
except OSError:
print("Cannot load : {}".format(fn))
That did find one image that was unloadable, for my case.
(for any of the other Udacity Deep Learning Nanodegree folks who might find this via search, the file dogImages/train\098.Leonberger\Leonberger_06571.jpg
was the unloadable file)
I trivially re-saved the file, which appears to have filled in any corrupted data, and the many-workers loader approach now works.