We encountered a performance problem in our code today. After some debugging it turns out that it has something to do with the numpy array fed into the PyTorch as_tensor function. If we make a copy of the numpy array before feeding, the code runs much faster than when feeding the original numpy array.
Here a short Python code which shows the behavior:
import torch
import numpy as np
from time import time
for i in range(100):
# create image (numpy array)
s = 1000
x = np.zeros([s, s, 3])
# crop image
x = x[:, 1:-1]
# create batch
x = x[None].transpose(0, 3, 1, 2)
# every 2nd iteration, create copy of numpy array
create_copy = bool(i % 2)
if create_copy:
x = x.copy()
# copy to gpu, apply some op, copy back
t0 = time()
x = torch.as_tensor(x).to('cuda')
x = x+1
x = x.to('cpu')
t1 = time()
print(f"{'copy' if create_copy else 'no copy'}: {1000*(t1-t0):.1f}ms")
The output looks something like this, where one sees that the copy is much faster:
...
no copy: 23.9ms
copy: 13.9ms
...
While creating the copy seems to solve the issue, I would still be interested what is going here. I could not locate the problem. Is it due to the view which is created by the slice operator? Or due to the non-contiguous memory?