Slow torch.load()

ChrisLiu2 · June 1, 2020, 3:00pm

In a word, torch.load() is taking about 30s-1min to load a 300Mb file. I can’t directly say this is too slow, but for older datasets of mine it’s almost instantly done. Here’s the way I’m doing it:
I have a large set of data in the form of csv array of size about 500Mb. But I found this is very slow to load using np.genfromtxt, so I load the files, convert them to torch tensors, divide it into 150 subsets in some partition and use torch.save() to save them. What might went wrong during the process?

ChrisLiu2 · June 1, 2020, 3:02pm

def get_datalist(path_series):
    """ Parses the CSD reecording csv file at path_series.
    Args:
        path_series: The path to the file with recording. (string)
    Returns:
        data_list: A list of tuples of data. Each tuple contains 
        (x, y). (list)
    """
    data_list = []

    series = get_series(path_series)

    origins = range(num_graphs)
    for origin in origins:
        data_list.append(
            (series[:,:,0,origin].unsqueeze(2), series[:,:,1,origin].long())
        )
    return data_list

def process_csv_to_pt():
    """
    This function takes the time series specified by a csv file,
    and convert every session of time series into a single .pt file.
    """
    path = 'Data/ready_data/series.txt'
    to_path = 'Data/ready_data/'
    series = get_datalist(path)
    for i in range(len(series)):
        series_single = series[i]
        if i <= 9:
            filename = 'ep0'+str(i)+'.pt'
        else:
            filename = 'ep'+str(i)+'.pt'
        print(i,'th file finished')
        torch.save( series_single, os.path.join(to_path, filename) )

def get_series(path_series):
    """
    This function reads csv files and returns a multidimensional
    numpy array of parsed data.
    """
    matrix = np.genfromtxt( path_series, delimiter=',' )
    matrix = torch.from_numpy(matrix)
    Series = torch.zeros( [N,T,2,num_graphs] ) #2 is for x, y

    assert matrix.shape[0] == N * 2
    assert matrix.shape[1] == T * num_graphs

    for graph_id in range(num_graphs):
        Series[ :,:,0,graph_id ] = matrix[ :N,      graph_id*T : (graph_id+1)*T ]
        Series[ :,:,1,graph_id ] = matrix[ N:2*N ,  graph_id*T : (graph_id+1)*T ]

    return Series

These are the code for converting from the csv file to .pt files. I’ll also paste how I load the .pt files.

ChrisLiu2 · June 1, 2020, 3:05pm

def read(self):
        self.read_file = True
        for ep in self.episode_lst:
            print(ep)
            filename = self.file_dict[ep]
            tup = torch.load( filename )
            print('Finished loading')
            if self.downsample == 0:
                self.file_dict[ep] = tup
            else:
                tup_new = (
                    tup[0][:,::self.downsample], 
                    tup[1][:,::self.downsample]
                    )
                self.file_dict[ep] = tup_new
        return self

This is the code snippet for loading files in my dataset object. self.episode_list is a list of IDs for the individual files, self.file_dict is a dict containing the path to each individual file. When I’m using a subset of the dataset for preliminary results, I can fit the whole dataset into memory, and therefore will directly load the files into the dataset object then create DataLoader objects, instead of loading files in DataLoader.