Using asyncio while waiting on GPU

This is more on the inference side of things, but while I am passing an image through a network and waiting on the GPU, I would like to get a head start on the performing CPU bound tasks on the next image. Would python’s asyncio be a path to go down? If so can someone help get me started? Here is the pseudo code.

Async def process_frame():
	forward()

while frame is not None:
    frame = next_frame() #CPU bound
    data = await process_frame(frame) #GPU bound inference

CUDA calls are already asynchronously executed, so as long as you don’t wait manually for a return value of this computation, your CPU program counter will continue the execution until a synchronization point is met (e.g. print(output)).

2 Likes

does it mean that training of a NN should not block the other function in that single-thread?

I tried it and it was, unfortunately, the case.

Do you know and can provide some examples where training will be executed with asyncio await syntax and does not block other processes.

Thank a lot!

I haven’t used asyncio, since the CUDA calls are asynchronous by default.
You should see overlapping compute if your CPU is fast enough to schedule the kernels and can run ahead. If that’s not the case (e.g. if the GPU workload is too small) you would be CPU-bound and the kernel execution won’t be overlapping.

how can/see I check that?

You can check it with a profiler such as Nsight Systems and check e.g. for whitespaces between the kernel execution.

Is there a way to use asyncio to return control to the loop when waiting for synchronization points?

I’m not familiar enough with asyncio to give a clear answer. :confused:

In our research we end up wrapping pytorch into a ThreadPool with a single thread and allow the asyncio free up the current MainThread to execute other asyncio tasks like data download while pytorch is computing.

_POOL = futures.ThreadPoolExecutor(max_workers=1)
def compute_wrapper(data): 
     model = thread_context.get("model") # way to find the model into the thread context or create.
     return model.forward(data)

... into the asyncio loop ..... 
async compute_event(data_url): 
   data = await http_client.get(data_url)
   result = await asyncio.get_event_loop().run_in_executor(_POOL, functools.partial(compute_wrapper, data))
   return result


... somewhere in the async code ...

def main(): 
   coros = [compute_event(data_url) for data in data_list ]
   results  = await asyncio.gather(*coros)

This code will perform len(data_list) concurrent downloads using asyncio main thread and perform forward pass on the single model without blocking the main thread waiting the result of pytorch and let it download more data because the thread that is waiting the result of pytorch is the one that is on the ThreadPool.

If the time of download is bigger than the time of compute you will see a several improvements on the speed of your script. If the cpu time is much much lower than your download time the result will be the same as waiting the pytorch to compute while downloading.

Note: Actually there is no way to guarantee that all pytorch code is writed into a thread safe way so we end up to create as single thread that interacts with the model. If you can load a different model for each thread you can increase the thread pool size. Anyway you won’t see any speed improvements loading multiple models per thread and sharing the same GPU, multiple GPUs can improve.

This code is good if you work with FastAPI, ASGI, aiopika ( RabbitMQ ), aiokafka, aiompq, or other async environments where the number of concurrent IO is really high and must be shared with the CPU bound task made by pytorch. But in a normal research, scripting and/or development case this can only complicate the work.