I have my data (very large np.array
’s) saved on disk as batches with np.memmap
. I am reading these batches as following:
x1 = np.memmap('path_to_file1', mode='c')
x2 = np.memmap('path_to_file2', mode='c')
...
and combine them with ConcatDataset
. I would like to apply some preprocessing to this combined dataset but I don’t know how I should continue.
The reason I am troubled is because I don’t want to create copies of the arrays. For example, lets say that I want to apply some standardization. Prior to converting to a ConcatDataset
I calculate the mean (weighted mean) and std (weighted std), then transforming the x
’s and finally converting to a ConcatDataset
.
That is, I am doing the following:
x1 = np.memmap('path_to_file1', mode='c')
x2 = np.memmap('path_to_file2', mode='c')
...
mean, std = get_weighted_mean_and_std([x1, x2, ...])
for x in [x1, x2, ...]:
x -= mean
x *= 1/std
dataset = ConcatDataset([x1, x2, ...])
The problem is during the assignmets -=
and *=
which return copies instead of view’s.