Best way to work with spark-generated datasets?


(Mikhail Osckin) #1

What is the best way to work with spark-generated output files? Now i’m using spark for raw data vectorization, so i just convert output parquet files into tfrecords and load them with tf dataset machinery. But it’s not clear how i can do something similar with pytorch because Dataset.getitem method requires item_idx, but most of big-data related tools and formats (for example parquet) don’t do well with sequential indexing.


(Lian Jiang) #2

Any help? There seems to be a gap between spark and pytorch. We can use spark to do preprocessing and need to convert it (e.g. in parquet format) to a format friendly to pytorch. Thanks.


(Mikhail Osckin) #3

Seems like the only way to do this right now is to write your own Dataset and DataLoader.