Measure the inference performance for multi-tasking

Hi all,

I want to measure the GPU inference time for multi-tasking on a single GPU. I used the code here:

with torch.no_grad():
    	starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
        starter.record()      
        output = model(input_batch)
        ender.record()
                
        torch.cuda.synchronize()
        curr_time = starter.elapsed_time(ender)

However, cuda.synchronize seems to synchronize across GPU. But I want to measure the GPU inference time for each of them separately. How to do synchronization for each one and measure the actual GPU inference time?

Appreciate for any help.

torch.cuda.synchronize accepts a device argument as seen in the docs. Also the Event object provides a synchronize() method in case you want to use it.
PS: you can post code snippets by wrapping them into three backticks ```, which makes debugging easier.

Thanks for reply. torch.cuda.synchronize accepts a device argument. However, I co-run multi-tasking on the same single GPU, the device will be only one. I am not sure if I can use torch.cuda.synchronize under co-running condition.

For event object, it seems that torch.cuda.Event.synchronize() does not works here.

with torch.no_grad():
    	starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
        starter.record()      
        output = model(input_batch)
        ender.record()    
        torch.cuda.synchronize()
        curr_time = starter.elapsed_time(ender)

I’m not sure how you are “co-running”, but in case you are using streams you might want to sync them.

Thanks for your help!