One approach…
Start 2 python programs, in separate interpreters to avoid the dreaded GIL lock.
Processor 1
- Put tensor on cuda:0, get the output.
- Serialize and push the output to shared redis database
Processor 2
- Consumer picks up from database, pushes to cuda:1
- Consumer runs the next step of calculation.
If you need to send gradients for backprop you can store and reload them also.
That’s one way… Not easy though. I spent easy a month just trying to distribute calculations over multiple processors.
If you can pull it off… then it’s an awesome skill.
Also, there is the Ray project GitHub - ray-project/ray: Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
I tried using it. It had great promise, but ended up being a bit too new at the time. It might be a bit more mature now.