x=torch.randn(1000,1000)
y=x**2
Can the operation of **2 speed up by GPU?
And what about other similar basic python operations? such as x[y>0.2,:]
Yes. But if you just want to perform a single or simple operations on a GPU, you will lose all the speed advantage via transfer speeds.
In [1]: import torch
In [2]: x = torch.randn(1000, 1000)
In [3]: %timeit x**2
7.33 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: x_gpu = x.to(torch.device('cuda:1'))
In [5]: %timeit x_gpu**2
48.8 µs ± 591 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
EDIT: And if you just care about simple linear algebra operations on the CPU, I would recommend using NumPy over PyTorch.
In [6]: import numpy as np
In [7]: x_np = np.array(x)
In [8]: %timeit x_np**2
364 µs ± 9.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Three quick comments:
- You absolutely need to add
torch.cuda.synchronize()
to time cuda ops. - While NumPy does have an edge over PyTorch on the CPU for some operations, in this specific example I’d imagine that you have NumPy compiled with a better blas than PyTorch or something. For me, the timings are very similar, with PyTorch having an edge.
- Gross slowness of PyTorch over NumPy is probably worth to file as a bug report. (My personal limit is 5-10 or so, but others are more ambitious.)
Best regards
Thomas
- You absolutely need to add
torch.cuda.synchronize()
to time cuda ops.
Good point, I actually get exactly the same results though when I run it as
%timeit xg**2; torch.cuda.synchronize()
Probably a coincidence.
I’d imagine that you have NumPy compiled with a better blas than PyTorch or something.
Could be, I actually just switched over to using conda for PyTorch vs. custom compiled as I found it is ~20% faster for some reason. The conda builts are using blas 1.0 and mkl 2019.1 as far as I can see from what’s being downloaded and installed (installing pytorch via conda from the pytorch channel will also install NumPy automatically, not sure if they are based on the same blas and mkl versions though as NumPy is probably fetched from conda’s main repo whereas the PyTorch binaries may have been compiled against sth different.
conda install pytorch torchvision cuda100 -c pytorch
Downloading and Extracting Packages
mkl_fft-1.0.6 | 196 KB | ##################################### | 100%
jpeg-9b | 248 KB | ##################################### | 100%
cryptography-2.4.2 | 607 KB | ##################################### | 100%
libgfortran-ng-7.3.0 | 1.3 MB | ##################################### | 100%
sqlite-3.26.0 | 1.9 MB | ##################################### | 100%
conda-4.5.12 | 1.0 MB | ##################################### | 100%
numpy-base-1.15.4 | 4.2 MB | ##################################### | 100%
blas-1.0 | 6 KB | ##################################### | 100%
mkl-2019.1 | 204.6 MB | ##################################### | 100%
ninja-1.8.2 | 1.3 MB | ##################################### | 100%
numpy-1.15.4 | 47 KB | ##################################### | 100%
cuda100-1.0 | 2 KB | ##################################### | 100%
certifi-2018.11.29 | 146 KB | ##################################### | 100%
freetype-2.9.1 | 822 KB | ##################################### | 100%
openssl-1.1.1a | 5.0 MB | ##################################### | 100%
torchvision-0.2.1 | 37 KB | ##################################### | 100%
libtiff-4.0.9 | 567 KB | ##################################### | 100%
python-3.7.1 | 36.4 MB | ##################################### | 100%
libpng-1.6.35 | 335 KB | ##################################### | 100%
olefile-0.46 | 48 KB | ##################################### | 100%
intel-openmp-2019.1 | 885 KB | ##################################### | 100%
pytorch-1.0.0 | 657.4 MB | ##################################### | 100%
pillow-5.3.0 | 595 KB | ##################################### | 100%
mkl_random-1.0.2 | 405 KB | ##################################### | 100%
PS: I don’t think PyTorch is slower compared against a regular NumPy version, but I think the conda team does some intel-specific particular optimization stuff so that this particular conda version may be faster.