Distributed training question

Jaeseon · June 9, 2023, 8:42pm

Greetings,
I have been working on a project to implement distributed training and data parallelization on a deep learning object detection model. I’ve been using PySpark, PyTorch, and SparkTorch, although unfortunately it was not working properly.
Could anyone provide resources or insights into these questions?

Is the usage of SparTorch (or similar tools such as TensorFlowonSpark or Horovod) necessary in this task? (Distributed Training?)
I have successfully loaded the necessary files from HDFS in PySpark.Dataframe format. Could I potentially bypass the use of SparkTorch and feed this dataset directly into a PyTorch deep learning model?
If the above is feasible, could you guide me on the process of converting PySpark.dataframe into a format that PyTorch can utilize?
If we can indeed bypass SparkTorch, would this mean that the data used in training is still parallelized? And it is trained distributed?
In this scenario, what role would SparkTorch(or similar tools) typically play?

For additional context, I have attempted to use the lates version of PySpark to take advantage of the built-in torch.distributor module. However, due to certain constraints, I’m currently working with PySpark 2.4 and Torch.