Best way to work with spark-generated datasets?


(Mikhail Osckin) #1

What is the best way to work with spark-generated output files? Now i’m using spark for raw data vectorization, so i just convert output parquet files into tfrecords and load them with tf dataset machinery. But it’s not clear how i can do something similar with pytorch because Dataset.getitem method requires item_idx, but most of big-data related tools and formats (for example parquet) don’t do well with sequential indexing.


(Lian Jiang) #2

Any help? There seems to be a gap between spark and pytorch. We can use spark to do preprocessing and need to convert it (e.g. in parquet format) to a format friendly to pytorch. Thanks.


(Mikhail Osckin) #3

Seems like the only way to do this right now is to write your own Dataset and DataLoader.


(Sandro Cavallari) #5

Hi all,

any update on this ??
I’m very interested, but also at this link there is no useful information https://docs.databricks.com/applications/deep-learning/pytorch.html

Apparently there is no support to load a distributed dataframe to pytorch in a spar fashion.