I have my data (very large
np.array’s) saved on disk as batches with
np.memmap. I am reading these batches as following:
x1 = np.memmap('path_to_file1', mode='c') x2 = np.memmap('path_to_file2', mode='c') ...
and combine them with
ConcatDataset. I would like to apply some preprocessing to this combined dataset but I don’t know how I should continue.
The reason I am troubled is because I don’t want to create copies of the arrays. For example, lets say that I want to apply some standardization. Prior to converting to a
ConcatDataset I calculate the mean (weighted mean) and std (weighted std), then transforming the
x’s and finally converting to a
That is, I am doing the following:
x1 = np.memmap('path_to_file1', mode='c') x2 = np.memmap('path_to_file2', mode='c') ... mean, std = get_weighted_mean_and_std([x1, x2, ...]) for x in [x1, x2, ...]: x -= mean x *= 1/std dataset = ConcatDataset([x1, x2, ...])
The problem is during the assignmets
*= which return copies instead of view’s.