ARM processor offloading

Hi, I’m a newbie in PyTorch.
I’ve been wondering if there is any reference or project going on or done already about offloading task to ARM processor.
I’ve wondered this by the reason below.
As far as I’m aware of, target devices, such as GPU, FPGA and etc, are used for offloading computation of some NN models.
The target devices are assumed to be connected to the mother board via PCIe interfaces.

My project is about offloading computation workload of a NN model to ARM processor currently and other DSP in the future.
So I’m working on building PyTorch to detect ARM processor on PCIe bus and include library for ARM processor offloading execution.

In summary, my questions are these.

  1. Is there any project or references that I can refer to which is about offloading a task/computation to a ARM processor via PCIe interfaces.
  2. If else, what about NEON architecture? I mean I can find some source code on PyTorch source code of NEON execution. Thus I wonder if it is possible to offload task to a NEON processor on a board which is connected to a mother board via PCIe interface?