CPU/GPU usage rate sometimes drops to 0 in 32G V100

Philokey · February 19, 2021, 7:33am

I encountered a strange problem when using 32G V100, and it’s not happened in titan xp or 16G V100. When running the follow code, the useage of CPU sometimes will drops to 0. In the mean time, the time cost of aten::addmm will increase to more than 100ms. However, it’s less than 1s under normal circumstances. You can get trace.json generated by profiler here

I am using pytorch 1.7.1+cu101, and no other programs are running on this machine when I test the follow code. I wonder why this phenomenon occurs.

Thank you very much.

import time
import torch
import torch.nn as nn
import torch.autograd.profiler as profiler

torch.set_num_threads(1)
mlp = nn.Linear(1000, 1000)
a = torch.rand([10, 1000])

print(time.strftime("%Y-%m-%d %H:%M:%S"), 'start')
t0 = time.perf_counter()
with profiler.profile(record_shapes=True) as prof:
    while True:
        t1 = time.perf_counter()
        out = mlp(a)
        t2 = time.perf_counter()
        if t2 - t1 > 0.05:
            print(time.strftime("%Y-%m-%d %H:%M:%S"), t2 - t1)
        if t2 - t0 > 120:
            break
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
prof.export_chrome_trace("trace.json")
print(time.strftime("%Y-%m-%d %H:%M:%S"), 'finish')