Why intra_op parallelism threads still consume CPU resource even when the computation is done?

I do optimizer.step() in python using CPU. By default, PyTorch set_num_threads to 20 on my server. After optimizer.step() I let the program time.sleep(3). When optimizer.step() is in process, the CPU usage increases to ~1900% and optimizer.step() lasts for ~50ms. But after optimizer.step() the CPU usage still keeps high as ~1900% for ~450ms. Then CPU usage decreases to 0.

If I set_num_threads to 5, the CPU usage increases to ~500% when optimizer.step() starts. But when the computation is done, the CPU usage still keeps ~400-500% for ~8x computation time.

I’m wondering why intra_op parallelism threads still consume CPU resource even when the computation is done? Clean-up? or just some async operations?