How to reduce the runtime required for torch.cpu()?

I’m doing a prediction on top of 5 documents and checking the profiling outputs. Each document may have 400+ pages each. I have around 5500 documents to lookup and it takes several hours to execute the prediction pipeline.

CPU RAM → more than 100GB available
GPU → 2 NVIDIA tesla v100 machines, 32GB memory each.

Even for 5 documents, it takes 520+ seconds to execute.

Here is the profiling output:

205403359 function calls (192726646 primitive calls) in 525.353 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      257  300.185    1.168  300.185    1.168 {method 'cpu' of 'torch._C._TensorBase' objects}
        5   22.183    4.437   25.969    5.194 base.py:867(dump)
       27   22.127    0.820   22.127    0.820 {built-in method gc.collect}
       34   14.393    0.423   14.406    0.424 {built-in method _pickle.load}
     1067   14.166    0.013   14.166    0.013 {method 'normal_' of 'torch._C._TensorBase' objects}
     5722   11.909    0.002   12.199    0.002 <frozen importlib._bootstrap_external>:914(get_data)
    69413   10.954    0.000   10.954    0.000 {built-in method posix.stat}
     1766    9.109    0.005    9.109    0.005 {method 'uniform_' of 'torch._C._TensorBase' objects}
      386    7.326    0.019    7.326    0.019 {built-in method numpy.concatenate}
     1989    6.604    0.003    6.604    0.003 {method 'cuda' of 'torch._C._TensorBase' objects}
   145759    6.180    0.000    6.180    0.000 {method 'findall' of 're.Pattern' objects}
 25863123    5.718    0.000   13.354    0.000 {built-in method builtins.isinstance}
     2384    5.379    0.002    5.379    0.002 {method 'copy_' of 'torch._C._TensorBase' objects}
    11924    5.325    0.000    5.325    0.000 {built-in method tensor}
11243096/72898    4.779    0.000   21.038    0.000 mixins.py:114(_build)
      950    3.833    0.004    3.835    0.004 {built-in method io.open}
 22511107    3.731    0.000    4.914    0.000 {built-in method _abc._abc_instancecheck}
        5    3.572    0.714    3.572    0.714 {built-in method _pickle.dump}
     2383    3.346    0.001    3.346    0.001 {method '_set_from_file' of 'torch._C.FloatStorageBase' objects}
       83    3.283    0.040    3.283    0.040 {method 'execute' of 'sybpydb.Cursor' objects}
 11206162    3.023    0.000   19.845    0.000 mixins.py:133(<genexpr>)

Is there anyway I can speed this up? Especially, this one:

{method ‘cpu’ of ‘torch._C._TensorBase’ objects}

I tried to use pin_memory=True in DataLoader, but it didnt make any difference. Really appreciate your suggestions.

CUDA operations are executed asynchronously. If you don’t synchronize the code, blocking operations (such as the .cpu() op) could accumulate the execution time from previous CUDA calls.
To get a better profile I would recommend to take a look at the timeline via the PyTorch Profiler or Nsight Systems.

Thank you! I’m trying to run the PyTorch profiler as you suggested.
In the meantime, can you give me more insight into where should I synchronize the code (is it where I create the data loader ?) . I’m quite lost there.

Generally, if you want to profile CUDA operations you would have to synchronize the code before starting and stopping the timer as seen here:

torch.cuda.synchronize()
t0 = time.time()

out = gpu_op(input)

torch.cuda.synchronize()
t1 = time.time()

Also, warmup iterations should be added and to stabilize the timings you should calculate the average time spent in this operation using multiple iterations.

With that being said, if you want to profile a specific op, you could use torch.utils.benchmark, which will already perform these steps for you (warmup, syncs, etc.).
I personally prefer to see a timeline if I’m profiling entire scripts, as I usually cannot easily spot the bottleneck in the output of accumulated timings.

Thank you for the response! I’m exploring this tutorial for in depth information. I will post an update as soon as I have some info.