Why is dtype conversion in Torch so slow compared to numpy?

Background:

We noticed that we had a major performance issue when running training on a server compared to locally. Investigation revealed that Torch’s dtype conversion of an image (3, 160, 120) from int to float was up to 500 times slower than numpy. Why is this, and is it a bug?

While locally its “only” around 2x slower, on the server gpu it’s almost x100 and even worse, when using a docker image with torch preinstalled. All conversion where done on CPU.

The only difference is using torch.Tensor(img_array.astype(np.float32)) instead of torch.Tensor(img_array).to(torch.float32)

Benchmarks:

  • 11th Gen Intel(R) Core™ i7-11850H @ 2.50GHz: x1.85
  • Intel Xeon Processor (Cascadelake): x88
  • Intel Xeon Processor (Cascadelake) with torch preinstalled in a docker image: x420

Torch version:

  • 2.9.1+cu128 / 2.9.0+cu128

Further information:

  • Using torchvision.transforms.v2.ToDType()did not make a difference
  • All conversion don on device=’cpu’

Test code:

# %%
import timeit

from PIL import Image
import torch
import numpy as np
from torchvision.transforms import v2

print(torch.get_default_device()) # should be 'cpu'
print(torch.__version__)

# %%
# Random image array
img_array = np.random.randint(0, 256,size=(160,120,3), dtype="uint8")
print(img_array.shape, img_array.dtype, img_array.min(), img_array.max())

N=1000

# %%
torch_time = timeit.timeit("torch.Tensor(img_array).to(torch.float32)", number=N, globals=globals())
print(torch_time)

# %%
np_time = timeit.timeit("torch.Tensor(img_array.astype(np.float32))", number=N, globals=globals())
print(np_time)

# %%
print(torch_time / np_time)

# %%
# Use this to get your cpu info on linux
# !lscpu

Further benchmarks using the code snipped above are welcome.

Here are some profiling information using this code snipped:

from torch.profiler import profile, ProfilerActivity, record_function

with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
    # with record_function("to"):
        [torch.Tensor(img_array).to(torch.float32) for n in range(1000)]

print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

For local cpu:

-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
               aten::to         3.97%       2.701ms        98.81%      67.141ms      33.570us          2000  
         aten::_to_copy         7.08%       4.810ms        94.84%      64.440ms      64.440us          1000  
            aten::copy_        80.34%      54.592ms        80.34%      54.592ms      54.592us          1000  
    aten::empty_strided         7.41%       5.038ms         7.41%       5.038ms       5.038us          1000  
       aten::lift_fresh         1.19%     808.544us         1.19%     808.544us       0.809us          1000  
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 67.949ms

For server cpu:

-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
               aten::to         0.11%      24.915ms        99.97%       22.183s      11.091ms          2000  
         aten::_to_copy         0.63%     140.687ms        99.86%       22.158s      22.158ms          1000  
            aten::copy_        98.68%       21.898s        98.68%       21.898s      21.898ms          1000  
    aten::empty_strided         0.54%     119.391ms         0.54%     119.391ms     119.391us          1000  
       aten::lift_fresh         0.03%       7.046ms         0.03%       7.046ms       7.046us          1000  
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 22.190s

For server cpu (with preinstalled torch image):

-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
               aten::to         0.06%      21.806ms        99.98%       35.551s      17.776ms          2000  
         aten::_to_copy         0.38%     135.256ms        99.91%       35.530s      35.530ms          1000  
            aten::copy_        99.32%       35.319s        99.32%       35.319s      35.319ms          1000  
    aten::empty_strided         0.21%      75.573ms         0.21%      75.573ms      75.573us          1000  
       aten::lift_fresh         0.02%       8.566ms         0.02%       8.566ms       8.566us          1000  
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 35.560s