Custom Dataset w/ Specific Requirements

sauhaardac · June 16, 2017, 8:59pm

For a research project, I need to load a dataset of images, and numerical labels for drive and motor data. Each image is also labeled with a flag, and I need to be able to set which image flags to ignore and keep during runtime. Currently the data loading system was written and optimized for Caffe and uses the hdf5 format for extracting these files. During training, extracting and converting the hdf5 data is the main bottleneck. Because of the ignore flag feature, I was not able to use the default PyTorch Dataset to dynamically load the data, since a single call to the get item in the dataset must return an item at each call. I am looking to convert our current dataset into a format that would be optimized for loading in to PyTorch, that also supports loading a frame and skipping it if it has an ignore flag set. My current code for importing the dataset and using it for training are available here:

###Loading hdf5 Dataset w/ Ignore List

github.com

sauhaardac/pytorch-neural-z2color/blob/master/libs/import_utils.py

import pickle

import h5py

from utils import *

Segment_Data = {}
hdf5_runs_path = desktop_dir('bair_car_data/hdf5/runs')
hdf5_segment_metadata_path = desktop_dir('bair_car_data/hdf5/segment_metadata')


def load_hdf5(path):
    F = h5py.File(path)
    labels = {}
    Lb = F['labels']
    for k in Lb.keys():
        if Lb[k][0]:
            labels[k] = True
        else:
            labels[k] = False

This file has been truncated. show original

###Use of Dataset in Traning Code

github.com

sauhaardac/pytorch-neural-z2color/blob/master/train.py

import argparse
import datetime
import random

import torch
import torch.nn as nn
import torch.nn.utils as nnutils
from torch.autograd import Variable
from libs.import_utils import *
from nets.z2_color import Z2Color

# Define Arguments and Default Values
parser = argparse.ArgumentParser(description='PyTorch z2_color Training',formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--validate', type=str, metavar='PATH',
                    help='path to model for validation')
# parser.add_argument('--skipfirstval', type=str, metavar='PATH',
#                     help='Skip the first validation (if restoring an end of epoch save file)')
parser.add_argument('--resume', type=str, metavar='PATH',
                    help='path to model for training resume')
parser.add_argument('--ignore', default=['reject_run', 'left', 'out1_in2', 'racing', 'Smyth'], type=str, nargs='+',

This file has been truncated. show original

What would be the best format to use for my particular use case? Are there any examples of setting up a similar type of dataset?

aitutakiv · June 18, 2017, 10:58pm

In general, I write my own Dataset class that inherits from the PyTorch Dataset and it handles all the logic of what data and labels to feed to the network when. Then the PyTorch Data Loader doesn’t have to know about any of that, it just loads pairs.

Disclaimer: I didn’t read the code, so I’m not sure precisely what the problem is.

sauhaardac · June 18, 2017, 11:30pm

Sorry if I wasn’t being clear. My specific problem is that the PyTorch Dataset class has a get_item function that requires an index. My problem is that I don’t know if data at a certain index will be used or not until I load it and check it’s flag to see if I should ignore it. Each time get_item is called I need to return something, so this hasn’t been working for me. Is there a way I can tell the pytorch dataset class to skip a particular index after a call to get_item?

aitutakiv · June 19, 2017, 12:14am

It only takes an index so that the DataLoader can load a certain number of images during training (e.g. one epoch worth of images). get_item could essentially ignore the index and iteratively load data, check the ignore flag, and only return the data if ignore is False. That’s probably better than skipping the index, because you’ll actually go through the same number of datapoints each time you call the DataLoader.

sauhaardac · June 19, 2017, 1:44am

How would the DataLoader know when I’m out of data? If the index is ignored and I just iteratively load data until it’s over, how can I signal to the data loader that I am done going through the dataset? Is it possible for get_item to return None and the DataLoader to ignore that index? I have my dataset indexed it’s just the ignore list that needs to happen dynamically.

aitutakiv · June 19, 2017, 2:47am

The DataLoader samples data points until it has selected len(dataset) number of samples. So you could just set the length of your dataset to be a fixed number (by overriding the __len__ method).

sauhaardac · June 19, 2017, 6:34am

With this approach could I continue to have epoch training behavior?
Currently I have some saving and validation code that should run after each
epoch of data is shown.