We sometimes train models using annotations from multiple datasets. Merging multiple datasets into 1 and building dataloaders take a lot of effort and many, many for loops. I (only) recently found that organizing datasets into SQL tables and do merges/queries greatly reduces the amount of code I have to write and that probably saved a lot of my hairs.
In fact something like COCO annotations
{“license”: 5,“file_name”: “COCO_train2011
4_000000057870.jpg”,“coco_url”: “https://images_cocodataset_org/train2014/CC
OCO_train2014_000000057870.jpg”,“height”: 480,“width”: 640,“date_captured”"
: “2013-11-14 16:28:13”,“flickr_url”: “https://farm4_staticflickr_com/3153//
2970773875_164f0c0b83_z.jpg”,“id”: 57870}, …
can be represented as a SQL table with fields “license”, “file_name”, “coco_url”, “height”, “width”, etc. Merging annotations from two different datasets can be done by SQL joins based on image names/indexes. Dataloaders could be written to process and return rows of a data table.
I think PyTorch could implement something like “torch.db” to do this. Traditional databases can’t efficiently handle GPU tensors, so an opportunity for PyTorch is to enable fast joins of tables with tensors and potentially optimally manage GRAM/RAM/disk access.