Pytorch and SQL

grrr · September 3, 2019, 5:57pm

We sometimes train models using annotations from multiple datasets. Merging multiple datasets into 1 and building dataloaders take a lot of effort and many, many for loops. I (only) recently found that organizing datasets into SQL tables and do merges/queries greatly reduces the amount of code I have to write and that probably saved a lot of my hairs.

In fact something like COCO annotations

{“license”: 5,“file_name”: “COCO_train2011
4_000000057870.jpg”,“coco_url”: “https://images_cocodataset_org/train2014/CC
OCO_train2014_000000057870.jpg”,“height”: 480,“width”: 640,“date_captured”"
: “2013-11-14 16:28:13”,“flickr_url”: “https://farm4_staticflickr_com/3153//
2970773875_164f0c0b83_z.jpg”,“id”: 57870}, …

can be represented as a SQL table with fields “license”, “file_name”, “coco_url”, “height”, “width”, etc. Merging annotations from two different datasets can be done by SQL joins based on image names/indexes. Dataloaders could be written to process and return rows of a data table.

I think PyTorch could implement something like “torch.db” to do this. Traditional databases can’t efficiently handle GPU tensors, so an opportunity for PyTorch is to enable fast joins of tables with tensors and potentially optimally manage GRAM/RAM/disk access.

RicCu · October 13, 2019, 3:04am

This issue talks about an internal FB framework that could benefit you. They were wondering if publishing it made sense and they were asking for input from the community, so you might want to check it out and comment on your use case.
I don’t really know the extent of that project nor whether it will ever be open source but it might be what you’re looking for.