Introducing the TorchSample package for comprehensive data transforms and sampling

Hi all,

As a new pytorch user, I found the data sampling and transforms code lacking for my personal use-case. It’s understandable, since the core developers are busy working on the more important stuff. Still, I wanted to quickly build up the available sampling code to the same level as tensorflow, keras, etc and I think I’ve accomplished that with the torchsample package.

The torchsample package is a 3rd-party library that includes code for comprehensive sampling from both in-memory and out-of-memory data, and includes a ton of useful augmentation transforms that apply directly on arbitrary torch tensors (rather than just PIL images).

It also has great support for situations where both the input and target tensors are images (e.g. segmentation datasets). It also supports arbitrary data types. It supersedes the currently available sampling code in the main torchvision/torch codebase.

Take a look here: https://github.com/ncullen93/torchsample

I’ve also wrote a fairly long tutorial showing how it works for a ton of common scenarios which can be found in the tutorials folder of the above repository.

It’s my hope that this will kick-start the community-driven development of the sampling code in the main torch and torchvision packages, and serve as reliable and flexible sampling code in the meantime.

NOTE: This package is in no way endorsed by, affiliated with, or otherwise associated with the official Pytorch ecosystem or team.

10 Likes

This is good stuff. I’m going to start focusing more on torchvision while sam and paszke focus on core, and will use parts of your package as inspiration.

2 Likes

Cool! I think there’s a huge need for this stuff… and no framework has quite converged on the optimal solution yet (see tensorflow is planning to clean up their disaster of an input pipeline now: https://github.com/tensorflow/tensorflow/issues/7951). Pytorch has a good opportunity to improve on everyone else’s mistakes from the start…

Another benefit of a common api for sampling is that you would remove the need to create a new data.Dataset subclass for each public dataset and instead would just need to download and pre-process the data… Would greatly increase your productivity in that area and allow for easily augmenting those datasets.

PS: the ImageFolder class seems kinda on an island in torchvision with the rest of the relevant sampling code (e.g. TensorDataset) in the main repo. I love the idea of torchvision for loading public datasets and pre-trained models… but maybe it would make more sense to separate the general purpose sampling code into a third torchvision folder (e.g. models, datasets, sampling) or develop the sampling solely in the main repo… just my very humble opinion of course.

Could you please give me some details on how to install this package? I have conda installed and have created an virtualenv for it. Thanks!

follow the instruction given here

If you are using pip, just clone, go to the directory with setup.py and run “pip install -e .” - this will read the setup and install the package (this way you can also easily uninstall it if needed)

Just as an update - torchsample has now become nitrain and is available at GitHub - ncullen93/nitrain: Augmentation, training, and visualization tools for medical imaging AI models. It is still focus on high-level tools for augmentation, training, and visualization of medical imaging AI models in pytorch.