PyTorch best practices for serialization of Datasets and using YAML config files

ac1d · August 31, 2020, 2:40pm

Hello everyone!

The tittle is pretty much self explanatory, but I would nonetheless like to elaborate a bit. Namely, I have two questions:

Very often we run multiple experiments on a single dataset and very often we also have a bunch of preprocessing steps that we want to perform on the data we have. I am wondering what is the best practice for serializing data after preprocessing? Is it considered best to serialize the entire torch.utils.data.Dataset object? Is it perhaps better to simply generate a preprocessed text file and read the input from there? Is there any preference when it comes to using pickle/dill? Feel free to elaborate and answer any other connected question that I have not asked, but you believe it’s important.
A common practice in Machine Learning is to use YAML config files to specify all sorts of things for your data preprocessing pipeline, your model, your layers, hyperparams etc. Are there any best practices on using YAML config files?

Looking forward to your answers

tom · August 31, 2020, 3:40pm

Ha, nothing like a best practice question for some bikeshedding.

I think serializing the dataset object is not a good idea.
- If applicable, I tend to favor some native format for the given datatype. It also comes down to performance considerations (I/O is typically slower than some processing, so e.g. saving a tensor representation instead of an JPEG for image data is not a good idea). I’ve also had similar things with medical imaging.
- I would avoid pickling. Typical options are using (dicts/lists of) tensors (using torch.save/load), numpy’s format, or, if you feel fancy, something like hdf5.
I think I’m much too grumpy for that. IMO whether or not yaml is a good format (as opposed to command line parameters, a .py or whatnot) depends on how large the number of parameters are and how generalizable you need it to be. For example, I would not encode model architecture in anything but code unless you really need to vary it a lot across your experiments. But then I might not be the best source for advice on this part.

Best regards

Thomas