We noticed that we had a major performance issue when running training on a server compared to locally. Investigation revealed that Torch’s dtype conversion of an image (3, 160, 120) from int to float was up to 500 times slower than numpy. Why is this, and is it a bug?
While locally its “only” around 2x slower, on the server gpu it’s almost x100 and even worse, when using a docker image with torch preinstalled. All conversion where done on CPU.
The only difference is using torch.Tensor(img_array.astype(np.float32)) instead of torch.Tensor(img_array).to(torch.float32)
Benchmarks:
11th Gen Intel(R) Core™ i7-11850H @ 2.50GHz: x1.85
Intel Xeon Processor (Cascadelake): x88
Intel Xeon Processor (Cascadelake) with torch preinstalled in a docker image: x420
Torch version:
2.9.1+cu128 / 2.9.0+cu128
Further information:
Using torchvision.transforms.v2.ToDType()did not make a difference
All conversion don on device=’cpu’
Test code:
# %%
import timeit
from PIL import Image
import torch
import numpy as np
from torchvision.transforms import v2
print(torch.get_default_device()) # should be 'cpu'
print(torch.__version__)
# %%
# Random image array
img_array = np.random.randint(0, 256,size=(160,120,3), dtype="uint8")
print(img_array.shape, img_array.dtype, img_array.min(), img_array.max())
N=1000
# %%
torch_time = timeit.timeit("torch.Tensor(img_array).to(torch.float32)", number=N, globals=globals())
print(torch_time)
# %%
np_time = timeit.timeit("torch.Tensor(img_array.astype(np.float32))", number=N, globals=globals())
print(np_time)
# %%
print(torch_time / np_time)
# %%
# Use this to get your cpu info on linux
# !lscpu
Further benchmarks using the code snipped above are welcome.
Here are some profiling information using this code snipped:
from torch.profiler import profile, ProfilerActivity, record_function
with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
# with record_function("to"):
[torch.Tensor(img_array).to(torch.float32) for n in range(1000)]
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
For local cpu:
----------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
----------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::to 3.97% 2.701ms 98.81% 67.141ms 33.570us 2000
aten::_to_copy 7.08% 4.810ms 94.84% 64.440ms 64.440us 1000
aten::copy_ 80.34% 54.592ms 80.34% 54.592ms 54.592us 1000
aten::empty_strided 7.41% 5.038ms 7.41% 5.038ms 5.038us 1000
aten::lift_fresh 1.19% 808.544us 1.19% 808.544us 0.809us 1000
----------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 67.949ms
For server cpu:
----------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
----------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::to 0.11% 24.915ms 99.97% 22.183s 11.091ms 2000
aten::_to_copy 0.63% 140.687ms 99.86% 22.158s 22.158ms 1000
aten::copy_ 98.68% 21.898s 98.68% 21.898s 21.898ms 1000
aten::empty_strided 0.54% 119.391ms 0.54% 119.391ms 119.391us 1000
aten::lift_fresh 0.03% 7.046ms 0.03% 7.046ms 7.046us 1000
----------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 22.190s
For server cpu (with preinstalled torch image):
----------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
----------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::to 0.06% 21.806ms 99.98% 35.551s 17.776ms 2000
aten::_to_copy 0.38% 135.256ms 99.91% 35.530s 35.530ms 1000
aten::copy_ 99.32% 35.319s 99.32% 35.319s 35.319ms 1000
aten::empty_strided 0.21% 75.573ms 0.21% 75.573ms 75.573us 1000
aten::lift_fresh 0.02% 8.566ms 0.02% 8.566ms 8.566us 1000
----------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 35.560s