Is there a way to provide arguments to a collate_fn() or in some other way get that function access to information other than ‘batch’?
The use case is this: I have a fairly large, custom data set in need of some sort of normalization. Exactly what normalization will be most effective is not obvious. It would therefore be convenient to use sklearn to construct a bunch of pre-populated scaler objects (say, a maxabs scaler, and a standardization scaler, just for starters) and use a collate_fn to perform the scaling on the fly.
But in order to do this, the collate_fn needs some way to get the file location of those scaler files. But how to do this eludes me.
(The brute force alternative of simply replicating the dataset multiple times, normalizing each one separately, is unpalatable do to the sheer size of the database.)
Depending on what you want, I’d probably try one of the following:
If the normalization is per example, add it to the dataset and keeping track of it in the dataset is the preferred way.
There are various other ways to achieve something similar to what @fmassa suggested (and even more variants when you search for “currying in python”). The lazy person’s way would be using default arguments: