I have a massive dataset on disk (far too big for main memory) for which I’m trying to create a custom Dataset class. It is a SQLite database (the Reddit May2015 comments dataset if you’re familiar with that).
Unfortunately, as far as I can tell, the SQLite database lacks a primary key for some reason, making the getitem query non-trivial – otherwise I’d just do “SELECT col FROM table WHERE rowid=%s” or something. My current idea is to use Apache Spark (pyspark), and wrap it in a Spark DataFrame via JDBC, but this feels a little roundabout.
Does anyone have experience loading massive SQLite databases in PyTorch?