What is the best way to work with spark-generated output files? Now i’m using spark for raw data vectorization, so i just convert output parquet files into tfrecords and load them with tf dataset machinery. But it’s not clear how i can do something similar with pytorch because Dataset.getitem method requires item_idx, but most of big-data related tools and formats (for example parquet) don’t do well with sequential indexing.
Any help? There seems to be a gap between spark and pytorch. We can use spark to do preprocessing and need to convert it (e.g. in parquet format) to a format friendly to pytorch. Thanks.
Seems like the only way to do this right now is to write your own Dataset and DataLoader.
any update on this ??
I’m very interested, but also at this link there is no useful information https://docs.databricks.com/applications/deep-learning/pytorch.html
Apparently there is no support to load a distributed dataframe to pytorch in a spar fashion.