SQLite => Custom Dataset

I have a massive dataset on disk (far too big for main memory) for which I’m trying to create a custom Dataset class. It is a SQLite database (the Reddit May2015 comments dataset if you’re familiar with that).

Unfortunately, as far as I can tell, the SQLite database lacks a primary key for some reason, making the getitem query non-trivial – otherwise I’d just do “SELECT col FROM table WHERE rowid=%s” or something. My current idea is to use Apache Spark (pyspark), and wrap it in a Spark DataFrame via JDBC, but this feels a little roundabout.

Does anyone have experience loading massive SQLite databases in PyTorch?

You can refer to the pytorch-sqlite