Copying CPU tensor on AMD CPU uses up all the CPU

jick · December 8, 2019, 5:24am

Hi all,

I found something strange: pin_memory() and clone() somehow use a huge amount of CPU time. I think tensor copy operation is somehow very inefficient for AMD CPUs. If I force it to use less thread with OMP_NUM_THREADS=1 CPU usage comes down.

Here’s an example using clone().

import os, time, threading
import numpy as np
import torch

use_pin = True
run_on_separate_thread = False

X = torch.zeros(1).to('cuda')  # Initialize CUDA.

def run():
    S = torch.zeros(1).to('cuda')
    for _ in range(1000):
        A = np.zeros(4 * 1024 * 1024, dtype=np.float32)
        B = torch.from_numpy(A)
        if use_pin: B = B.pin_memory()
        B = B.to('cuda', non_blocking=True)
        S += B.sum()

    print('Sum = ', S)  # should be zero

usertime0, systime0, _, _, T0 = os.times()

if run_on_separate_thread:
    thr = threading.Thread(target=run)
    thr.start()
    thr.join()
else:
    run()

usertime1, systime1, _, _, T1 = os.times()

print('user %7.3f | sys %7.3f | real %7.3f' %
      (usertime1 - usertime0, systime1 - systime0, T1 - T0))

Interestingly, when I run the code in a separate thread somehow the problem becomes much worse (I ran each case three times):

If we call run() directly:
    Without pin_memory():
        user   3.520 | sys   0.280 | real   3.820
        user   3.580 | sys   0.220 | real   3.820
        user   3.520 | sys   0.320 | real   3.850
 
    With pin_memory():
        user  30.620 | sys   0.150 | real   3.880
        user  30.720 | sys   0.100 | real   3.890
        user  30.660 | sys   0.170 | real   3.890

    With pin_memory(), OMP_NUM_THREADS=1:
        user   4.050 | sys   0.020 | real   4.090
        user   4.040 | sys   0.020 | real   4.080
        user   4.070 | sys   0.020 | real   4.100

If we call run() on a separate thread:
    Without pin_memory():
        user   3.510 | sys   0.290 | real   3.820
        user   3.550 | sys   0.270 | real   3.840
        user   3.570 | sys   0.240 | real   3.820

    With pin_memory():
        user  40.080 | sys  27.930 | real   4.320
        user  41.980 | sys  31.910 | real   4.760
        user  43.390 | sys  27.080 | real   4.500

    With pin_memory(), OMP_NUM_THREADS=1:
        user   4.100 | sys   0.030 | real   4.140
        user   4.090 | sys   0.020 | real   4.130
        user   4.130 | sys   0.020 | real   4.170

I’m using Ryzen 1700X, GTX 1080, Python 3.7.4, and PyTorch 1.3.0 (also tried building PyTorch from latest source and got similar result). Is this one of those cases where the underlying Intel MKL library doesn’t play nicely with AMD CPUs? Or am I doing something wrong?

For now, I think I’ll just skip pin_memory() altogether (and be careful never to copy tensor from CPU to CPU), but it might be nice if a fix is available. In my code, it was using so much CPU that it became the bottleneck even though everything after that was running on CUDA.

ptrblck · December 8, 2019, 6:11am

We had a similar issue in the past, which should be fixed with this PR.
Unfortunately, I don’t have an AMD CPU for testing, but I can run your script on an Intel one.

jick · December 8, 2019, 11:19pm

Hi, thanks for the info!

However, I could see the slowdown even with the latest version. Upon close looking, looks like PR #25111 (that you mentioned) is concerned with torch.utils.data and not the tensor API?

…So maybe the solution could be adding a similar logic to torch.Tensor.pin_memory()?

ptrblck · December 8, 2019, 11:46pm

That’s a good catch! Let me reproduce this issue and have a look at it.