Pytorch tensor constructor speed vs numpy

So I was comparing the performance of the tensor constructor to the numpy array constructor

For pytorch

torch.inference_mode()
total_time = 0.0
iterations = 10000
for _ in range(iterations):
    data = np.random.normal(0, 1, (1000, 10)).tolist()  # Use numpy for RNG
    # but convert back to python since we're benchmarking constructor
    t1 = time.time()
    thing = torch.tensor(data, dtype=torch.float64)
    total_time += time.time() - t1
print(total_time / iterations)

I get 0.0003917569875717163 s per iteration.

For numpy

total_time = 0.0
iterations = 10000
for _ in range(iterations):
    data = np.random.normal(0, 1, (1000, 10)).tolist()
    t1 = time.time()
    thing = np.array(data, dtype=np.float64)
    total_time += time.time() - t1
print(total_time / iterations)

0.0002772393465042114 s. A small difference, not too concerning; maybe pytorch tensors are associated with a little bit more overhead.

I remembered that when pytorch tensors are constructed from numpy arrays, torch will copy by reference instead of allocating a new block of memory. Out of curiosity, I benchmarked constructing a numpy array and then constructing a tensor out of that.

torch.inference_mode()
total_time = 0.0
iterations = 10000
for _ in range(iterations):
    data = np.random.normal(0, 1, (1000, 10)).tolist()  # Use numpy for RNG
    # but convert back to python since we're benchmarking constructor
    t1 = time.time()
    thing = torch.from_numpy(np.array(data, dtype=np.float64))
    total_time += time.time() - t1
print(total_time / iterations)

0.00027820470333099365 s. That seems really weird; it seems the performance difference isn’t in the overhead of the python object, but at the memory allocation/access layer. In any case, if torch tensor construction is that much slower, shouldn’t torch just call numpy to create the array in memory first by default?

1 Like

I don’t understand this sentence. Are you suggesting PyTorch should not copy the data if the user is explicitly creating a new tensor but should internally refer to from_numpy?
In my opinion it would be a breaking change causing a lot of silent issues since the original numpy array would now be manipulated by inplace ops.
If you want to share the memory use from_numpy explicitly.

I’m asking why not replace torch.tensor(some_python_iterable) with torch.from_numpy(np.array(some_python_iterable)) if the latter is more performant?

I’m talking about constructing new arrays/tensors. We can discard references to the numpy array since it would just be an intermediate quantity

Are you asking for use cases where you would not use from_numpy?
Once example would be if you don’t want to manipulate the original numpy array via inplace operations. Besides that you are free to use it.

No, my main question is why is numpy array construction so much faster than tensor construction?

My secondary question is that if it’s faster to construct a numpy array and then use torch.from_numpy(np.array(data)), why not always do that? inplace operations are irrelevant because all references to the numpy array are discarded immediately; it serves no purpose other than as an intermediate quantity

In none of your benchmarks you are comparing a numpy vs. PyTorch tensor construction.
The first one copies data from numpy to PyTorch, the second one copies data from numpy to numpy (or could be a no-op as I don’t know how numpy handles it internally), and the last one copies metadata around.

Again, you are comparing data copies to metadata manipulation.

Are you sure? The data is in the form of a python list when the calls are being made. As I wrote

    data = np.random.normal(0, 1, (1000, 10)).tolist()  # Use numpy for RNG
    # but convert back to python since we're benchmarking constructor

Are you saying that numpy and pytorch somehow know that the variable references data converted from a numpy array and can copy from the original memory location?

The data has to be copied otherwise you would be able to manipulate the original data inplace:

data = np.random.normal(0, 1, (3,)).tolist()
x = torch.tensor(data, dtype=torch.float64)

print(data)
# [-1.2352113632799868, 1.3052671736570356, -1.344748132474519]
print(x)
# tensor([-1.2352,  1.3053, -1.3447], dtype=torch.float64)

x[0] += 1000.

print(data)
# [-1.2352113632799868, 1.3052671736570356, -1.344748132474519]
print(x)
# tensor([998.7648,   1.3053,  -1.3447], dtype=torch.float64)

If no copy is triggered where would data live?

No, I meant to say I don’t believe you are profiling the right workload based on your claims.
If you want to profile the actual generation of random values, you should profile np.random.randn vs. torch.randn, which is faster on my system for PyTorch:

3.3374714851379394e-05 # torch
0.00013933465480804442 # numpy

On the other hand you are seeing that reusing the same data via from_numpy is faster than copying data into a new tensor, which I would expect to see.

The data has to be copied otherwise you would be able to manipulate the original data inplace:

If you want to profile the actual generation of random values, you should profile np.random.randn vs. torch.randn , which is faster on my system for PyTorch:

I didn’t mean to suggest that the data isn’t being copied. I’m also not trying to benchmark random number generation. I was responding to this

The first one copies data from numpy to PyTorch, the second one copies data from numpy to numpy

What I am trying to benchmark is

copying data from native python lists into a pytorch tensor

vs

copying data from native python lists into a numpy array

vs

copying data from native python lists into a numpy array, then constructing a tensor that references the same block of memory

The reason I’m benchmarking this is to minimize the latency impact of my python data preprocessing step during inference*.

On the other hand you are seeing that reusing the same data via from_numpy is faster than copying data into a new tensor, which I would expect to see.

What’s confusing to me is that the total time cost for both creating the numpy array from a python list and then also reusing it in a tensor is higher than just constructing the tensor from the list.

    t1 = time.time()
    thing = torch.from_numpy(np.array(data, dtype=np.float64))
    total_time += time.time() - t1

To reiterate, this last benchmark includes both the time to construct the numpy array from the list and also the time to construct the pytorch tensor on top, but is faster than just constructing the tensor directly.

*Of course, if I were really serious about latency, I’d do preprocessing in a fast language and interact with the model via torchscript, but for various reasons, I can’t do that in my setting