Caffe Generated LMDB Data Load Crashes in Pytorch

Hi

I am trying to load LMDB dataset that was prepared in Caffe. I use LSUNClass as reference to load the data.

dataloader =   Data.DataLoader(LSUNClass(db_path=path),batch_size=50,  shuffle=True, 
                                               num_workers=6, pin_memory=False)

The dataLoader object is created without anyissue. But when I iterate through

for step, (x, y) in enumerate(dataloader):

The program crashes @ getitem

def __getitem__(self, index):
    img, target = None, None
    env = self.env
    with env.begin(write=False) as txn:
        imgbuf = txn.get(self.keys[index])
        print('The buffer information :',len(imgbuf),index,self.keys[index])
    buf = six.BytesIO()
    buf.write(imgbuf)
    buf.seek(0)
    img = Image.open(buf).convert('RGB')

File “”, line 30, in getitem
img = Image.open(buf).convert(‘RGB’)
File “/anaconda/envs/py35/lib/python3.5/site-packages/PIL/Image.py”, line 2319, in open
% (filename if filename else fp))
OSError: cannot identify image file <_io.BytesIO object at 0x7f677db0b570>

Please Note, However, if I load LSUN LMDB dataset default from Pytorch then it works fine.Same set of code is used but with Caffe Lmdb it crashes…

Any inputs will be appreciated.

do you have Pillow installed and not PIL? That’s one thing I could think of.

Other than that, have a look at https://github.com/pytorch/vision/blob/master/torchvision/datasets/lsun.py#L32-L49 which might help.

Hi smth

Thanks for your reply. I have PIL installed and I have used same piece of code from the share link from LSUNClass. Update the question with code snippet.

Strange thing is I can read LSUN dataset which is also in LMDB format but I cannot read my Caffe LMDB dataset

Regards

you need to have Pillow installed, not PIL (both provide import PIL)

Sorry for confusion. I have pillow in my package list. Imported as

from PIL import Image

I also tried with python 2.7 version to load lmdb as caffe uses python 2.7 but no affect.
I tried to load the data without lmdb just with image name and label as list by overriding getitem function. But it is horribly slow. Thats why I tried to load lmdb. Will raise issue of loading dataset as list in separate chain.

But I am still unable to figure out why

Image.open(buf). fails

I printed the

print('The buffer information :',len(imgbuf),index,self.keys[index])

It prints correctly the Key. Problem is with IO buffering. Tried to use OpenCV and the error goes away but it has problem further down the chain.

 with env.begin(write=False) as txn:
        imgbuf = txn.get(self.keys[index])
        print('The buffer information :',len(imgbuf),index,self.keys[index])
    import cv2
    import numpy
    img = cv2.imdecode(
    numpy.fromstring(imgbuf, dtype=numpy.uint8), 1)

hmm, i dont have any more pointers, but it does look like some bug wrt how the buffer is, maybe there’s a newline at the end that PIL doesn’t expect, but maybe OpenCV is okay with it?

Ok thanks… I dont know because using the LMDB generated by Caffe. Not sure internals of it. For now I have to stick to Caffe for the use case I’m addressing.

One request is it would be great if we get generic utilities to prepare the dataset in LMDB, HDF5,LIST that is compatible with Pytorch ( Goes through the iterator and batch perfectly) for seamless data loading and data preparation. Lot of time is wasted after this.

Regards

One request is it would be great if we get generic utilities to prepare the dataset in LMDB, HDF5,LIST that is compatible with Pytorch

You can use regular python for this. PyTorch does not need anything special.

Thanks Soumith

I already have python scripts to do the same.It seems I will have to study the PyTorch Framework in more detail especially DataLoader to make sure integration is seamless. Will post queries for more pointers if I hit a problem while doing that.

have you solved the problem? I also met the same problem.