Caffe Generated LMDB Data Load Crashes in Pytorch

huzaifa_kapasi · July 1, 2017, 11:26am

Hi

I am trying to load LMDB dataset that was prepared in Caffe. I use LSUNClass as reference to load the data.

dataloader =   Data.DataLoader(LSUNClass(db_path=path),batch_size=50,  shuffle=True, 
                                               num_workers=6, pin_memory=False)

The dataLoader object is created without anyissue. But when I iterate through

for step, (x, y) in enumerate(dataloader):

The program crashes @ getitem

def __getitem__(self, index):
    img, target = None, None
    env = self.env
    with env.begin(write=False) as txn:
        imgbuf = txn.get(self.keys[index])
        print('The buffer information :',len(imgbuf),index,self.keys[index])
    buf = six.BytesIO()
    buf.write(imgbuf)
    buf.seek(0)
    img = Image.open(buf).convert('RGB')

File “”, line 30, in getitem
img = Image.open(buf).convert(‘RGB’)
File “/anaconda/envs/py35/lib/python3.5/site-packages/PIL/Image.py”, line 2319, in open
% (filename if filename else fp))
OSError: cannot identify image file <_io.BytesIO object at 0x7f677db0b570>

Please Note, However, if I load LSUN LMDB dataset default from Pytorch then it works fine.Same set of code is used but with Caffe Lmdb it crashes…

Any inputs will be appreciated.

smth · July 2, 2017, 1:48pm

do you have Pillow installed and not PIL? That’s one thing I could think of.

Other than that, have a look at https://github.com/pytorch/vision/blob/master/torchvision/datasets/lsun.py#L32-L49 which might help.

huzaifa_kapasi · July 3, 2017, 9:08am

Hi smth

Thanks for your reply. I have PIL installed and I have used same piece of code from the share link from LSUNClass. Update the question with code snippet.

Strange thing is I can read LSUN dataset which is also in LMDB format but I cannot read my Caffe LMDB dataset

Regards

smth · July 3, 2017, 1:53pm

you need to have Pillow installed, not PIL (both provide import PIL)

huzaifa_kapasi · July 3, 2017, 3:36pm

Sorry for confusion. I have pillow in my package list. Imported as

from PIL import Image

I also tried with python 2.7 version to load lmdb as caffe uses python 2.7 but no affect.
I tried to load the data without lmdb just with image name and label as list by overriding getitem function. But it is horribly slow. Thats why I tried to load lmdb. Will raise issue of loading dataset as list in separate chain.

But I am still unable to figure out why

Image.open(buf). fails

I printed the

print('The buffer information :',len(imgbuf),index,self.keys[index])

It prints correctly the Key. Problem is with IO buffering. Tried to use OpenCV and the error goes away but it has problem further down the chain.

 with env.begin(write=False) as txn:
        imgbuf = txn.get(self.keys[index])
        print('The buffer information :',len(imgbuf),index,self.keys[index])
    import cv2
    import numpy
    img = cv2.imdecode(
    numpy.fromstring(imgbuf, dtype=numpy.uint8), 1)

smth · July 3, 2017, 5:30pm

hmm, i dont have any more pointers, but it does look like some bug wrt how the buffer is, maybe there’s a newline at the end that PIL doesn’t expect, but maybe OpenCV is okay with it?

huzaifa_kapasi · July 4, 2017, 3:21am

Ok thanks… I dont know because using the LMDB generated by Caffe. Not sure internals of it. For now I have to stick to Caffe for the use case I’m addressing.

One request is it would be great if we get generic utilities to prepare the dataset in LMDB, HDF5,LIST that is compatible with Pytorch ( Goes through the iterator and batch perfectly) for seamless data loading and data preparation. Lot of time is wasted after this.

Regards

smth · July 5, 2017, 4:35am

One request is it would be great if we get generic utilities to prepare the dataset in LMDB, HDF5,LIST that is compatible with Pytorch

You can use regular python for this. PyTorch does not need anything special.

huzaifa_kapasi · July 5, 2017, 5:02am

Thanks Soumith

I already have python scripts to do the same.It seems I will have to study the PyTorch Framework in more detail especially DataLoader to make sure integration is seamless. Will post queries for more pointers if I hit a problem while doing that.

Li_Rong · June 2, 2018, 10:42am

have you solved the problem? I also met the same problem.