Best way to work with spark-generated datasets?

mikhail.osckin · February 13, 2018, 2:42pm

What is the best way to work with spark-generated output files? Now i’m using spark for raw data vectorization, so i just convert output parquet files into tfrecords and load them with tf dataset machinery. But it’s not clear how i can do something similar with pytorch because Dataset.getitem method requires item_idx, but most of big-data related tools and formats (for example parquet) don’t do well with sequential indexing.

Lian_Jiang · July 2, 2018, 9:34pm

Any help? There seems to be a gap between spark and pytorch. We can use spark to do preprocessing and need to convert it (e.g. in parquet format) to a format friendly to pytorch. Thanks.

mikhail.osckin · July 5, 2018, 10:41am

Seems like the only way to do this right now is to write your own Dataset and DataLoader.

andompesta · November 26, 2018, 4:09am

Hi all,

any update on this ??
I’m very interested, but also at this link there is no useful information https://docs.databricks.com/applications/deep-learning/pytorch.html

Apparently there is no support to load a distributed dataframe to pytorch in a spar fashion.

aalkhalil155 · June 3, 2019, 2:50pm

Hello :

I am encountering the same challenges at this moment working on spark data and need to feed it in Pytorch model.

I am wondering if there has been a solution for this common problem after all this time?

thanks