Hi all,
I found something strange: pin_memory()
and clone()
somehow use a huge amount of CPU time. I think tensor copy operation is somehow very inefficient for AMD CPUs. If I force it to use less thread with OMP_NUM_THREADS=1
CPU usage comes down.
Here’s an example using clone()
.
import os, time, threading
import numpy as np
import torch
use_pin = True
run_on_separate_thread = False
X = torch.zeros(1).to('cuda') # Initialize CUDA.
def run():
S = torch.zeros(1).to('cuda')
for _ in range(1000):
A = np.zeros(4 * 1024 * 1024, dtype=np.float32)
B = torch.from_numpy(A)
if use_pin: B = B.pin_memory()
B = B.to('cuda', non_blocking=True)
S += B.sum()
print('Sum = ', S) # should be zero
usertime0, systime0, _, _, T0 = os.times()
if run_on_separate_thread:
thr = threading.Thread(target=run)
thr.start()
thr.join()
else:
run()
usertime1, systime1, _, _, T1 = os.times()
print('user %7.3f | sys %7.3f | real %7.3f' %
(usertime1 - usertime0, systime1 - systime0, T1 - T0))
Interestingly, when I run the code in a separate thread somehow the problem becomes much worse (I ran each case three times):
If we call run() directly:
Without pin_memory():
user 3.520 | sys 0.280 | real 3.820
user 3.580 | sys 0.220 | real 3.820
user 3.520 | sys 0.320 | real 3.850
With pin_memory():
user 30.620 | sys 0.150 | real 3.880
user 30.720 | sys 0.100 | real 3.890
user 30.660 | sys 0.170 | real 3.890
With pin_memory(), OMP_NUM_THREADS=1:
user 4.050 | sys 0.020 | real 4.090
user 4.040 | sys 0.020 | real 4.080
user 4.070 | sys 0.020 | real 4.100
If we call run() on a separate thread:
Without pin_memory():
user 3.510 | sys 0.290 | real 3.820
user 3.550 | sys 0.270 | real 3.840
user 3.570 | sys 0.240 | real 3.820
With pin_memory():
user 40.080 | sys 27.930 | real 4.320
user 41.980 | sys 31.910 | real 4.760
user 43.390 | sys 27.080 | real 4.500
With pin_memory(), OMP_NUM_THREADS=1:
user 4.100 | sys 0.030 | real 4.140
user 4.090 | sys 0.020 | real 4.130
user 4.130 | sys 0.020 | real 4.170
I’m using Ryzen 1700X, GTX 1080, Python 3.7.4, and PyTorch 1.3.0 (also tried building PyTorch from latest source and got similar result). Is this one of those cases where the underlying Intel MKL library doesn’t play nicely with AMD CPUs? Or am I doing something wrong?
For now, I think I’ll just skip pin_memory()
altogether (and be careful never to copy tensor from CPU to CPU), but it might be nice if a fix is available. In my code, it was using so much CPU that it became the bottleneck even though everything after that was running on CUDA.