torch.from_numpy().cuda() performance anomaly

Hi,when I was testing torch.from_numpy(input).float().cuda(), I found that the performance deteriorated seriously at one or several times after multiple runs. What is the reason?

The test code is as follows(To avoid other effects, I added several more syncs):

import numpy as np
import torch
import time

def torch_from_numpy_test(
    input
):
    torch.cuda.synchronize()
    T1 = time.perf_counter()
    input_torch = torch.from_numpy(input).float().cuda()
    torch.cuda.synchronize()
    T2 = time.perf_counter()
    print('torch.from_numpy time: %s ms' % ((T2 - T1)*1000))
    out = input_torch.cpu().numpy().astype(np.float32)
    torch.cuda.synchronize()       
    return out

random_array = np.random.rand(512, 384, 18).astype(np.float32)
def funA():
    all_peaks = []
    for part in range(18):
        map_ori = random_array[:, :, part]
        torch.cuda.synchronize()
        one_heatmap = torch_from_numpy_test(map_ori)
        torch.cuda.synchronize()
        all_peaks.append(one_heatmap)
    print(all_peaks[0][32][32])
funA()

The performance of the test is as follows:

torch.from_numpy time: 7.923923432826996 ms
torch.from_numpy time: 0.6886869668960571 ms
torch.from_numpy time: 0.5026236176490784 ms
torch.from_numpy time: 0.4269257187843323 ms
torch.from_numpy time: 0.42704492807388306 ms
torch.from_numpy time: 0.4085637629032135 ms
torch.from_numpy time: 0.3131367266178131 ms
torch.from_numpy time: 0.30875951051712036 ms
torch.from_numpy time: 0.30804798007011414 ms
torch.from_numpy time: 0.29971450567245483 ms
torch.from_numpy time: 0.29381364583969116 ms
torch.from_numpy time: 0.2989359200000763 ms
torch.from_numpy time: 0.7551312446594238 ms
torch.from_numpy time: 0.29600411653518677 ms
torch.from_numpy time: 0.292457640171051 ms
torch.from_numpy time: 75.8533850312233 ms
torch.from_numpy time: 0.6768964231014252 ms
torch.from_numpy time: 0.5467236042022705 ms

You can see that the performance of the third to last time has deteriorated a lot.

Interference from CPU frequency and GPU frequency has been ruled out. The Nvidia GPU models I tested are L20/L40/Tesla V100.

I cannot reproduce the issue locally and see:

torch.from_numpy time: 0.17070200010493863 ms
torch.from_numpy time: 0.16841700016811956 ms
torch.from_numpy time: 0.16823700025270227 ms
torch.from_numpy time: 0.1740280004014494 ms
torch.from_numpy time: 0.1677860000199871 ms
torch.from_numpy time: 0.16776600023149513 ms
torch.from_numpy time: 0.1684970002315822 ms
torch.from_numpy time: 0.16803600010462105 ms
torch.from_numpy time: 0.16790599966043374 ms
torch.from_numpy time: 0.16854699970281217 ms
torch.from_numpy time: 0.18444699981046142 ms
torch.from_numpy time: 0.16969900025287643 ms
torch.from_numpy time: 0.1677959999142331 ms
torch.from_numpy time: 0.17140300042228773 ms
torch.from_numpy time: 0.1729460000206018 ms
torch.from_numpy time: 0.21386299977166345 ms
torch.from_numpy time: 0.16987000071821967 ms
torch.from_numpy time: 0.16843699995661154 ms
0.53998953

after adding warmup iterations. Without warmup I see the slower first iteration which is expected:

torch.from_numpy time: 12.98495599985472 ms
torch.from_numpy time: 0.4250469992257422 ms
torch.from_numpy time: 0.2027099999395432 ms
torch.from_numpy time: 0.18392499987385236 ms
torch.from_numpy time: 0.1791260001482442 ms
torch.from_numpy time: 0.17840399959823117 ms
torch.from_numpy time: 0.1780439997673966 ms
torch.from_numpy time: 0.2653379997354932 ms
torch.from_numpy time: 0.18182100029662251 ms
torch.from_numpy time: 0.16730400056985673 ms
torch.from_numpy time: 0.16778500139480457 ms
torch.from_numpy time: 0.16728400078136474 ms
torch.from_numpy time: 0.16740399951231666 ms
torch.from_numpy time: 0.20811000104004052 ms
torch.from_numpy time: 0.16985900037980173 ms
torch.from_numpy time: 0.16719299856049474 ms
torch.from_numpy time: 0.16699300067557488 ms
torch.from_numpy time: 0.16727399997762404 ms
0.53998953

@ptrblck Hi, what model of Nvidia GPU did you test on? Can you test it on L20/L40/Tesla V100?

Or test a few more models?

In addition, does the situation I tested exist in principle?

I used a 3090 before. Also not reproducible on an L40:

torch.from_numpy time: 42.23870858550072 ms
torch.from_numpy time: 0.4154574126005173 ms
torch.from_numpy time: 0.3297366201877594 ms
torch.from_numpy time: 0.35751238465309143 ms
torch.from_numpy time: 0.32346323132514954 ms
torch.from_numpy time: 0.3537144511938095 ms
torch.from_numpy time: 0.3529563546180725 ms
torch.from_numpy time: 0.35276636481285095 ms
torch.from_numpy time: 0.3412459045648575 ms
torch.from_numpy time: 0.3423839807510376 ms
torch.from_numpy time: 0.34457631409168243 ms
torch.from_numpy time: 0.3435257822275162 ms
torch.from_numpy time: 0.32216496765613556 ms
torch.from_numpy time: 0.34396350383758545 ms
torch.from_numpy time: 0.3284439444541931 ms
torch.from_numpy time: 0.31896308064460754 ms
torch.from_numpy time: 0.359976664185524 ms
torch.from_numpy time: 0.32056495547294617 ms
0.19517606
torch.from_numpy time: 0.23299269378185272 ms
torch.from_numpy time: 0.3405846655368805 ms
torch.from_numpy time: 0.22905319929122925 ms
torch.from_numpy time: 0.34337304532527924 ms
torch.from_numpy time: 0.22673234343528748 ms
torch.from_numpy time: 0.3509838134050369 ms
torch.from_numpy time: 0.23114122450351715 ms
torch.from_numpy time: 0.3410652279853821 ms
torch.from_numpy time: 0.2221241593360901 ms
torch.from_numpy time: 0.34335628151893616 ms
torch.from_numpy time: 0.2222638577222824 ms
torch.from_numpy time: 0.2362225204706192 ms
torch.from_numpy time: 0.3552064299583435 ms
torch.from_numpy time: 0.3424752503633499 ms
torch.from_numpy time: 0.22837333381175995 ms
torch.from_numpy time: 0.347774475812912 ms
torch.from_numpy time: 0.22511184215545654 ms
torch.from_numpy time: 0.3429148346185684 ms
0.19517606