Distributed Training on PyTorch

I’m currently immersed in a project where I’m leveraging PyTorch to develop an object detection model using satellite imagery. My immediate objective is to perform distributed training on this model using PySpark.
While I have found several tutorials and examples on image classification, I’m having trouble translating these resources to suit my needs.
Specifically, to the best of my knowledge, I think that I need to load images and annotation files using PySpark, and subsequently convert or transform these files into a format that’s compatible with PyTorch for the purpose of object detection model building. I’m eagerly seeking advice or any pointers towards helpful tutorials or examples that can aid me in refining and constructing my model.
During my search, I have stumbled upon some resources related to using SparkTorch or pyspark.ml.torch.distributor, and horovod. However, I’ve been encountering difficulties in successfully installing horovod. I appreciate any guidance on this issue as well.

One alternative to PySpark is Ray. You can do a little research online to see if it is something you want to use. Ray is used a lot in distributed learning so there are plenty of guides online.

I have experience building object detection models using PyTorch or TensorFlow, but I’m new to the concepts of data parallelization and distributed training, which has left me feeling confused. As a result, I have a few questions to help clarify my understanding.

I currently have my image and annotation files stored in HDFS (Hadoop Distributed File System) directories. Initially, I assumed that I would need to use PySpark to load the data for the purpose of data parallelization and distributed training in deep learning models. However, after reading about examples of ‘Ray,’ I realized that it does not rely on Spark.

Here are my questions:

  1. Does Ray fulfill a similar role to Spark?
  2. Do you believe it’s unnecessary to use PySpark even if our files are stored in HDFS?
  3. When using Ray, is it necessary for me to have stored my files in HDFS?

I don’t use Spark and HDFS, so my input will be very limited.

  1. Ray is lower level compared to Spark. You can run Spark on top of Ray. Spark is recommended when database operations are required. See this discussion: https://discuss.ray.io/t/what-is-the-difference-between-ray-and-spark/1578

  2. You can load from HDFS when using Ray. See this discussion: https://discuss.ray.io/t/ray-distributed-load-from-hdfs/6043

  3. It is not necessary. You can use dataloaders, imagefolders etc.