How to find the cause of function runtime instability

I now have a simple feature extraction function. The process of the function involves transferring results calculated on the GPU to the CPU. Subsequently, the computed results need to be combined with an additional ID to form a collection of IDs (approximately 256k IDs). Then, I will extract corresponding data from the “feat” data. However, during my testing, I found that the computation time of this function is highly unstable. In my printout, the computation time varies from 0.005s to 0.026s. I would like to understand the reasons behind this issue and explore possible optimizations to stabilize this computation time.

def featMerge(self,nodeids):    
        toCPUTime = time.time()  
        nodeids ='cpu')
        print("to CPU time {}s".format(time.time()-toCPUTime))
        catTime = time.time()
        temp_merge_id[1:] = nodeids
        print("cat time {}s".format(time.time()-catTime))
        featTime = time.time()
        test = self.feats[temp_merge_id]       
        print("feat merge {}s".format(time.time()-featTime))
        print("all merge {}s".format(time.time()-toCPUTime))
        return test

CUDA operations are executed asynchronously while the D2H transfer is blocking. Without explicit synchronizations blocking calls will accumulate CUDA execution times and will report invalid timings.

Thanks a lot for the reply, it helped me a lot. In addition, I would like to ask, is there a better way to extract the features corresponding to the specified index from feat? In my project, this is a running bottleneck

test = self.feats[temp_merge_id]

Indexing is the right operation to get the feature. Did you properly synchronize the code and narrowed down the bottleneck to this op? If so, how slow is it compared to other operations?

Yes, after analysis, it’s true that the bottleneck I want to compute is indeed at this point. The time required for my other operations is approximately 0.02 seconds, while the feature extraction step takes about 0.03 seconds of my time (finding the positions required among 1.2 million indices for 256,000 locations, and these locations might have duplicates). Compared to other existing implementations, my self-built solution is indeed much slower in terms of performance. I’m not sure if there are areas that can be optimized.