Loading large image data

Mark_Martinez · April 18, 2018, 1:16am

Hello, I’m trying to load a large image dataset that won’t fit into RAM. I’ve looked up a similar question here on the forums, but can’t seem to get the answer working. the variable data_loc has the directory to images and targets.

class MyDataset(Data.Dataset):
    def __init__(self):
        self.data_files = os.listdir(data_loc)
        #sort(self.data_files)

    def __getindex__(self, idx):
        return load_file(self.data_files[idx])

    def __len__(self):
        return len(self.data_files)

set_test = MyDataset()
loader = Data.DataLoader(set_test,batch_size = BATCH_SIZE, num_workers=8)


for step, (x,y) in enumerate(set_test):
    *do stuff*
    set_test = MyDataset()
    loader = Data.DataLoader(set_test,batch_size = BATCH_SIZE, num_workers=8)

But I get a not implemented error for the set_test. Any thoughts on how to fix this?

ptrblck · April 18, 2018, 11:27am

You should change __getindex__ to __getitem__.

Also, the usual approach is to iterate your DataLoader, not the Dataset.
Try:

for batch_idx, (x, y) in enumerate(loader):
    #do stuff

Is there a reason you are re-initializing the Dataset and DataLoader in the for-loop?

Mark_Martinez · April 18, 2018, 4:13pm

Thank you!

I think I’m on the right track now. I was confused about what to iterate over and if it needed to be re-initialized.

I just am stuck on getting the data loader portion now. I figured it would be pretty easy, but I’m not sure how to create the file load_file function.

My files are stored in a file data where the subsequent files are as follows

/data/images/ images to use
/data/targets.txt

Is this the correct format for a loader to work, or do I need to have each batch be a new set of folders?

data_loc = '.../data/'

def load_file(file):
    
class MyDataset(Data.Dataset):
    def __init__(self, data_files):
        self.data_files = sorted(data_files)

    def __getitem__(self, index):
        return load_file(self.data_files[index])

    def __len__(self):
        return len(self.data_files)
set_test = MyDataset(data_loc)
loader = Data.DataLoader(set_test,batch_size = BATCH_SIZE, num_workers=8)```

ptrblck · April 18, 2018, 4:22pm

The folder looks good. However, you will need the target so that the Dataset will return the data sample and its target.
If you have images in the folder, you can simply use PIL.Image.open() to load the image.
After loading, you could apply transformations on this image and finally fast it to a Tensor.

Let me know, it you need any help.

Mark_Martinez · April 18, 2018, 11:36pm

Hi, sorry to continue to ask for help. I’ve done a bit more and get the error “OSError: [Errno 24] Too many open files”. I’ve tried adding some lines that seemed to work in fixing this error for other people, but it’s still not working.

I couldn’t figure out how to get the targets appended to each individual image so I created an array of the targets separated out and append that to each image file as it comes in. (Not sure if that will work).

data_loc = '.../sample_data/'
torch.multiprocessing.set_sharing_strategy('file_system')

target_counter = 0

"""
Getting the targets
"""
with open('.../annotations.txt') as f:
    content = f.readlines()
#makes each one a float
targets = [x.split(',') for x in content]
for a in targets:
    for ind,val in enumerate(a):
        #a[ind] = int(float(val))
        a[ind] = float(val)
targets = torch.FloatTensor(targets)


def load_file(file):
    temp = Image.open(file)
    keep = temp.copy()
    temp.close()
    data = tuple(keep,targets[target_counter])
    target_counter = target_counter + 1
    return data
    
class MyDataset(Data.Dataset):
    def __init__(self, data_files):
        self.data_files = sorted(data_files)

    def __getitem__(self, index):
        return load_file(self.data_files[index])

    def __len__(self):
        return len(self.data_files)
    
set_test = MyDataset(data_loc)
loader = Data.DataLoader(set_test,batch_size = BATCH_SIZE, num_workers=8)

ptrblck · April 19, 2018, 8:25am

Don’t be sorry for asking for help

Unfortunately I’m not familiar with the sharing strategies, so I don’t know it setting it to file_system helps.
Did you read it somewhere?

I cannot see, where you are opening a lot of files without closing them. Could you post the whole code please?

Also, besides the error you are seeing, your code is a bit dangerous, since you have a loose mapping between the input and target. While the file is loaded using index in __getitem__, you are using a target_counter to load the target. If you set shuffle=True in the DataLoader your data will be randomly assigned to the next target.

To fix this, you could pass index to load_file and use targets[index].

Mark_Martinez · April 19, 2018, 2:16pm

Well thank you !

I just pulled the file system line from another Pytorch forum question that seemed to have a similar issue.

This is all the code that’s running and causing the errors at this point. I’m not opening files in any other part of the code right now.

Annotations.txt has data that looks like:

1, 205, 5.976959, 9.223372E+18, 13.00167, 9.223372E+18, 9.223372E+18, 2.116816, 3.283184, 9.223372E+18
1, 210, 2.403473, 9.223372E+18, 13.00638, 9.223372E+18, 9.223372E+18, 2.744155, 2.655845, 9.223372E+18

with each newline being a new input vector.

And just to be safe I’ve moved the targets outside of the folder I’m loading from so I only call images.

so the folders are now
…/sample_data2/images/image files.bmp
…/sample_data/annotations.txt

data_loc = '/Users/markmartinez/Downloads/sample_data2/'
torch.multiprocessing.set_sharing_strategy('file_system')

"""
Getting the targets
"""
with open('/Users/markmartinez/Downloads/sample_data/annotations.txt') as f:
    content = f.readlines()
#makes each one a float
targets = [x.split(',') for x in content]
for a in targets:
    for ind,val in enumerate(a):
        #a[ind] = int(float(val))
        a[ind] = float(val)
targets = torch.FloatTensor(targets)


def load_file(file,index):
    temp = Image.open(file)
    keep = temp.copy()
    temp.close()
    data = tuple(keep,targets[index])
    return data
    
class MyDataset(Data.Dataset):
    def __init__(self, data_files):
        self.data_files = sorted(data_files)

    def __getitem__(self, index):
        return load_file(self.data_files[index],index)

    def __len__(self):
        return len(self.data_files)
    
set_test = MyDataset(data_loc)
loader = Data.DataLoader(set_test,batch_size = BATCH_SIZE, num_workers=8)

and this is the error I’m getting


Traceback (most recent call last):
  File "/anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2847, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-53-eb2d61b6aaa1>", line 7, in <module>
    with open('/Users/markmartinez/Downloads/sample_data/annotations.txt') as f:
OSError: [Errno 24] Too many open files: '/Users/markmartinez/Downloads/sample_data/annotations.txt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 1795, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'OSError' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/anaconda/lib/python3.5/site-packages/IPython/core/ultratb.py", line 1092, in get_records
    return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
  File "/anaconda/lib/python3.5/site-packages/IPython/core/ultratb.py", line 312, in wrapped
    return f(*args, **kwargs)
  File "/anaconda/lib/python3.5/site-packages/IPython/core/ultratb.py", line 347, in _fixed_getinnerframes
    records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
  File "/anaconda/lib/python3.5/inspect.py", line 1454, in getinnerframes
    frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
  File "/anaconda/lib/python3.5/inspect.py", line 1411, in getframeinfo
    filename = getsourcefile(frame) or getfile(frame)
  File "/anaconda/lib/python3.5/inspect.py", line 671, in getsourcefile
    if getattr(getmodule(object, filename), '__loader__', None) is not None:
  File "/anaconda/lib/python3.5/inspect.py", line 700, in getmodule
    file = getabsfile(object, _filename)
  File "/anaconda/lib/python3.5/inspect.py", line 684, in getabsfile
    return os.path.normcase(os.path.abspath(_filename))
  File "/anaconda/lib/python3.5/posixpath.py", line 362, in abspath
    cwd = os.getcwd()
OSError: [Errno 24] Too many open files```

ptrblck · April 19, 2018, 2:56pm

Ok, could you check ulimit -n in a terminal and if possible increase the size?
Are you working on a remote server or a local machine?
Could it be that the machine just uses a lot of file handlers?

Mark_Martinez · April 20, 2018, 1:40pm

Hi! The file error was happening because I needed to restart the kernel.

I realized that there is a great tutorial on dataloading that solves a lot of my issues from the inception of my problems so I think I’m good on this particular issue now.

http://pytorch.org/tutorials/beginner/data_loading_tutorial.html

Thank you so much for your help!

Joe1 · July 13, 2018, 5:46am

I wonder if it will be slow when i using this method.