Hi, I’m working on developing a library similar to torchvision for geospatial data. Think multi-spectral satellite imagery datasets/transforms/pre-trained models.
When perusing the torchvision source code, I noticed that there are a lot of inconsistencies due to the long history of development and changing needs of users. For example:
- Datasets return images and targets, or images and masks and classes, or images and dicts with bounding boxes
- Datasets accept transform, target_transform, and/or transforms
- Transforms support PIL Images and/or torch Tensors
- Transforms subclass object, or nn.Module (for TorchScript support)
My question is, if you were going to write a library like torchvision from scratch, which of these are state-of-the-art best practices, and which of these are just leftover for backwards compatibility? Is best practice to write all new transforms as a subclass of nn.Module, or is there some advantage to more generic object transforms? Since I’m working with multi-spectral imagery, PIL Images won’t work at all, so I’ll likely write all transforms for torch Tensors. If I want consistency between datasets and transforms for various tasks (object detection, instance segmentation, etc.), should all datasets return (and all transforms accept) dicts with keys for possible components (image, mask, bounding boxes, class labels, etc.)?
I’m also curious what other issues the torchvision developers have faced over the years and how they solved them. Would love to meet over a video call.