Best way to work with spark-generated datasets?


(Mikhail Osckin) #1

What is the best way to work with spark-generated output files? Now i’m using spark for raw data vectorization, so i just convert output parquet files into tfrecords and load them with tf dataset machinery. But it’s not clear how i can do something similar with pytorch because Dataset.getitem method requires item_idx, but most of big-data related tools and formats (for example parquet) don’t do well with sequential indexing.