Applying a transformation to a ConcatDataset

ado_sar · May 13, 2023, 5:19pm

I have my data (very large np.array’s) saved on disk as batches with np.memmap. I am reading these batches as following:

x1 = np.memmap('path_to_file1', mode='c')
x2 = np.memmap('path_to_file2', mode='c')
...

and combine them with ConcatDataset. I would like to apply some preprocessing to this combined dataset but I don’t know how I should continue.

The reason I am troubled is because I don’t want to create copies of the arrays. For example, lets say that I want to apply some standardization. Prior to converting to a ConcatDataset I calculate the mean (weighted mean) and std (weighted std), then transforming the x’s and finally converting to a ConcatDataset.

That is, I am doing the following:

x1 = np.memmap('path_to_file1', mode='c')
x2 = np.memmap('path_to_file2', mode='c')
...
mean, std = get_weighted_mean_and_std([x1, x2, ...])

for x in [x1, x2, ...]:
    x -=  mean
    x *= 1/std

dataset = ConcatDataset([x1, x2, ...])

The problem is during the assignmets -= and *= which return copies instead of view’s.