The code for DataPipes looks like writing a new language where every statement is transforming a DataPipe into another one, or combining from or splitting into multiple DataPipes. So in theory, if all the DataPipes are PyTorch builtin, then one can analyze the “graph” of the DataPipe code and essentially translate them into something similar to TorchScript, and execute it in C++ in a multithreaded manner. I was wondering if PyTorch devs have already explored the idea or not.
Thanks for asking. We do consider jittable DataPipe. But, for now, we haven’t explored the option to make DataPipe itself to C++.
We might be able to jit multiple map functions like
datapipe.map(fn1).map(fn2).map(fn3) to a single function call → Those functions might also be jit together.
Also want to note that It’s kind hard to run data-pipeline in multithread manner with the deterministic result.
We are open to any new idea to make DataPipe more efficient.