Dataset size greater than cpu memory, how to make a dataset

I have a 400GB data,but my cpu memory is only 256GB. The first parameter of torch.utils.data.DataLoader is dataset. I fount i still need to load all the data to memory when i create a dataset. following is my code
class SignalDataset(Data.Dataset):

class SignalDataset(Data.Dataset):
    def __init__(self, signal_path, label_path, signal_max_len, label_max_len):
        with open(signal_path, 'r') as tr:
            self.signals = tr.readlines()
        with open(label_path, 'r') as te:
            self.labels = te.readlines()
        self.signal_max_len = signal_max_len
        self.label_max_len = label_max_len

    @staticmethod
    def transform(signal):
        '''
        :param signal: shape (length,)
        :return: shape (length, 1)
        '''
        signal = np.array([float(v) for v in signal]).reshape(-1,1)
        return signal

    def __getitem__(self, index):
        signal_max_len = self.signal_max_len
        label_max_len = self.label_max_len
        signal = self.signals[index].strip().split(',')
        # str 2 int
        label = np.array([int(d) for d in self.labels[index].strip().split(',')])
        # str 2 float, reshape
        signal = np.array([float(v) for v in signal]).reshape(-1,1)
        assert len(signal.shape) == 2
        assert signal_max_len - signal.shape[0] >= 0
        signal = np.pad(signal, ((0, signal_max_len - signal.shape[0]), (0, 0)),
                        mode='constant',
                        constant_values=Constants.SIG_PAD)

        assert len(label.shape) == 1
        label = np.pad(np.array(label), (1, 1), mode='constant',
                       constant_values=(Constants.BOS, Constants.EOS))
        # label PAD padding
        assert label_max_len - label.shape[0] >= 0
        label = np.pad(label, (0, label_max_len - label.shape[0]), mode='constant',
                       constant_values=Constants.PAD)
        return signal, label

    def __len__(self):
        return len(self.labels)

How is your data stored? Usually you would just pass the file path to your __init__ method and load the data lazily in __getitem__ to avoid loading the whole dataset into memory.

the data are store in signal file and label file. In signal file, each line is several float values, and in label file, each line is several int values. But each line have different length.

can you give me a example, thank you very much

I think you could try to use pd.read_csv with the chunksize argument to load your data lazily.
Have a look at the doecumentation for an example.

1 Like

Refer to How to use dataset larger than memory?