GPU Pipelines Scripting

jvd · March 29, 2026, 11:32pm

Hi guys.

Is there a way and materials/docs on how to extend torch functionality writing an operations directly on GPU instructions level? Or maybe does this stuff have some special term to google?

Or maybe torch allow to write something similar to sql-transactions/query batches/macros etc so that defined calculation pipeline would be processed in a single GPU pipeline without switching device context?

E.g. something like

with torch.trx('mps'):
   cat = torch.cat([a,b,c])
   return cat.sum()

Thank you

ptrblck · March 30, 2026, 12:40pm

I don’t fully understand your question so please correct me if I’ misunderstood it.

You can write custom operators as described in this tutorial allowing you to write directly CUDA code if needed.

I guess you might be interested in the torch.device context manager which will execute all operations defined inside the context on the specified device.

jvd · March 31, 2026, 7:05am

Sure thing. As long as I don’t fully understand my self just some raw vision)

In my understanding each single operator is produced within specific device over a communication from something cpu process to this particular device. Due to some more specific to tensor operation arch on that device operations are executed much faster. And then computation flow is returned back to cpu process. e.g.

composition = torch.cat([a, b])
total = composition.sum()

in this example cat and sum are processed at tensors device still to chain this operations we are intermediately switch back to cpu (python code it self) to compose the flow

So my assumption was whether it is possible to process in theory both operation in a single command at device (say CUDA) without intermediate bridge return to CPU. So that my thoughts was it should increase the overall performance.

Great. Seems something related I’ve looking for. Will dive into it. Thanks

Do I understand it right each device (as CUDA, MPS etc.) has its own code instructions. so that in case of custom operators each should be reimplemented per each individual device separately?

Thank you.