I’m trying to keep things in a postgres database, because - well, it’s complicated.
Is there anyone who’s done this in an efficient manner with the DataLoader and Dataset classes? I’m relatively proficient at Google-Fu, and no dice so far.
I really would prefer not to have to export from postgres to numpy arrays or csvs, but it seems that those are the best ways I can do this.
Please, someone prove me wrong if you can.
Would it be possible to query your data using some kind of index? If so you could try to get the sample from your Postgres database in the
__getitem__ function of your
Dataset and try to convert it to a numpy array using e.g. warp_prism.
From there you could transform your numpy array to a tensor using
I haven’t tried it and am absolutely no database expert, but it could be a potential way.
Let me know, if I’m completely missing some crucial aspects.
Actually… that might work. I need to think about it.
No… wait. It will work. Thanks. Brilliant. I should have thought of that myself.
Just thought I would share with ya. Since running so many single record queries can be a bottleneck for performance.
I found considerable speed up when querying my dataset by simply returning my indexes in the getitem function.
Then the list provided to a custom collate function can be converted to a tuple and used with a WHERE IN query which returns the batch records in a single query,
I ran into this issue too. I made two dataloaders. First dataloader checks length of table you are querying on
__init__ and then
__getitem__ spits out a large amount of row IDs from the database. Second dataloader is initialized with these indices as an
__init__ parameter, grabs all of those indices from the DB and sets them to
self.X. You can then batch off of that second loader to train your model.
Still interested in other solutions, but this solution gets rid of the problem of running select statements on every
# Indices is array of row IDs from SQL (mabye 1,000,000 at a time)
for indices in partition_loader:
# Second loader initialized within this loop w/ those indices
loader = data.DataLoader(local_dataset(indices, **params) **loader_params)
for batch_x, batch_y in loader:
I think that this would probably be a cleaner way to do it than the one we settled on. Possibly more efficient as well.
So I obtained batches directly from the psql db with psycopg2 using a single dataloader and dataset.
I was using a connection pool in my instantiation of the dataset object.
I have since switched to fast5 files for my training datasets and using pandas for performance reasons but I prefer the data wrangling capabilities of SQL. Would be nice if training your model this way was more efficient.
def __getitem__(self, idx: int) -> int:
def collate(self, results: list) -> tuple:
query = """SELECT label, feature FROM dataset WHERE index IN %s"""
result = None
conn = self.conn_pool.getconn()
cursor = conn.cursor()
result = cursor.fetchall()
except Error as conn_pool_error:
#format batches here
#format batches here
if result is not None:
throw Exception('problem in obtaining batches')
I found it easier to not use Dataset and Dataloader at all and rely on async. Here’s a writeup (with code): https://firstname.lastname@example.org/using-postgres-as-a-dataloader-with-pytorch-bba0d5cbe1fa
Is there a way to wrap this in a DataLoader? PyTorch Lightning requires the use of DataLoaders.