Best way to work with spark-generated datasets?

What is the best way to work with spark-generated output files? Now i’m using spark for raw data vectorization, so i just convert output parquet files into tfrecords and load them with tf dataset machinery. But it’s not clear how i can do something similar with pytorch because Dataset.getitem method requires item_idx, but most of big-data related tools and formats (for example parquet) don’t do well with sequential indexing.


Any help? There seems to be a gap between spark and pytorch. We can use spark to do preprocessing and need to convert it (e.g. in parquet format) to a format friendly to pytorch. Thanks.

Seems like the only way to do this right now is to write your own Dataset and DataLoader.

Hi all,

any update on this ??
I’m very interested, but also at this link there is no useful information

Apparently there is no support to load a distributed dataframe to pytorch in a spar fashion.

Hello :

I am encountering the same challenges at this moment working on spark data and need to feed it in Pytorch model.

I am wondering if there has been a solution for this common problem after all this time?