Sorry for super late reply.
Thanks to your insightful answer, I could deal with all problems I’ve faced!
For dealing with OOM problem, I’ve utilized np.memmap
function for loading data directly from the storage.
Here is code snippet:
class PartialDataset(Dataset):
"""
* Description: custom `Dataset` module for processing `.npy` files (N, C, H, W) (N > 1) grouped by date
- i.e. mini-batched .npy file stored by date
- Therefore, the number of samples, 'N', is different from each other...
"""
def __init__(self, read_path, date, transform=None):
"""
* Arguments:
- read_path (string): path of `.npy` files
- data (string): date(yymmdd) as a file name
- transform (callable, optional): optional transform to be applied on a sample
"""
self.transform = transform
self.path = read_path
self.date = date
self.data = self.read_memmap(f'{os.path.join(self.path, self.date)}.npy')
def read_memmap(self, file_name):
"""
* Descripton: read `np.memmap` file from the directory
* Argument:
- file_name (string): path of '.npy' and '.npy.conf' files
* Output:
- whole data loaded in a memory-efficient manner (np.memmap)
"""
with open(file_name + '.conf', 'r') as file:
memmap_configs = json.load(file)
return np.memmap(file_name, mode='r+', shape=tuple(memmap_configs['shape']), dtype=memmap_configs['dtype'])
def __getitem__(self, index):
"""
* Description: function for indexing samples
* Argument:
- index (int): index of the sample
* Output:
- input data, output data (torch.Tensor, torch.Tensor)
- (batch_size, 4 (Mask(0 - background, 1 - foreground) / input1 / input2 / input3), height, width), (batch_size, output, height, width)
"""
mask = torch.Tensor(self.data[index, 0, :, :]).reshape(1, PATCH_HEIGHT, PATCH_WIDTH)
inputs = torch.Tensor(self.data[index, 2:4, :, :])
output = torch.Tensor(self.data[index, 1, :, :]).reshape(1, PATCH_HEIGHT, PATCH_WIDTH)
if self.transform is not None:
inputs = self.transform(inputs)
inputs = np.concatenate([mask, inputs], axis=0)
return (inputs, output)
def __len__(self):
"""
* Description: fucntion for noticing the length of dataset
* Output:
- length (int)
"""
return self.data.shape[0]
After I read separate *.npy
files as a PartialDataset
instance, I stored them in the list like below:
def construct_partial_dataset(read_path, test_date=['20190212', '20190612', '20190912', '20191216'], transform=transform):
test_date = test_date
training_list, test_list = [], []
for path, dirs, files in os.walk(read_path):
if dirs != []: continue
file_list = sorted(files)
for file in file_list:
if '.conf' in file: continue
is_train = False if file[:8] in test_date else True
if is_train:
training_list.append(PartialDataset(read_path=os.path.join(read_path, 'training'), date=file[:8], transform=transform))
else:
test_list.append(PartialDataset(read_path=os.path.join(read_path, 'validation'), date=file[:8], transform=transform))
return training_list, test_list
After constructing two lists (training_list
and test_list
), I’ve followed your advice: ConcatDataset
and Subset
to construct a final training/test dataset containing whole corresponding arrays.
def concat_list_of_partial_datasets(dataset_list, num_samples=-1):
"""
* Description: function of concatenating `PartialDataset` object to be a dataset using `torch.utils.data.ConcatDataset`
* Argument:
- dataset_list (list): list of `PartialDataset` instances
* Output:
- a dataset (torch.utils.data.ConcatDataset)
"""
dataset = None
for i in range(len(dataset_list) - 1):
if i == 0:
idx1 = torch.randperm(len(dataset_list[i]))[:num_samples]
idx2 = torch.randperm(len(dataset_list[i + 1]))[:num_samples]
dataset1 = Subset(dataset_list[i], idx1)
dataset2 = Subset(dataset_list[i + 1], idx2)
dataset = ConcatDataset((dataset1, dataset2))
else:
idx = torch.randperm(len(dataset_list[i + 1]))[:num_samples]
dataset_next = Subset(dataset_list[i + 1], idx)
dataset = ConcatDataset((dataset, dataset_next))
return dataset
Finally, I could utilize DataLoader
for loading data!
training_set, test_set = concat_list_of_partial_datasets(dataset_list=training_list, num_samples=8192), concat_list_of_partial_datasets(dataset_list=test_list, num_samples=1024)
training_loader = DataLoader(training_set, batch_size=HYPERPARAMS['batch_size'], shuffle=True, num_workers=16)
test_loader = DataLoader(test_set, batch_size=HYPERPARAMS['batch_size'], shuffle=False, num_workers=16)
Since I am not that proficient in Python and PyTorch, still believe there exists more memory-efficient and better way to solve the situation like mine…!
(Please let me know if my code has problems…)
Thank you again for your kind and gentle answer.
Have a good day!