Is torch.tensor(x, device='cuda') blocking?

nlgranger · August 2, 2023, 12:18pm

During a profiling session, I got the impression that torch.tensor(x, device='cuda') is blocking. That is, the python script only carries on after the data is effectively on GPU. If an experiment has many small kernels, it can be difficult to refill the GPU queue and get GPU utilization up again.

If torch.tensor is indeed blocking, one should use torch.tensor(x, pin_memory=True).to('cuda') to avoid a synchronization point.

I have not been able to showcase this behaviour on a toy example, and I don’t know where to look into the PyTorch code. If someone savvy could provide some clarification, that would be great.

Related issues:

github.com/pytorch/pytorch

[ux] Non-blocking tensor constructors

opened 12:45PM - 08 Apr 23 UTC

vadimkantorov

triaged enhancement has workaround module: tensor creation

### 🚀 The feature, motivation and pitch It appears that regular `torch.tensor…`/`torch.as_tensor` constructors incur blocking behavior when placing the result in CUDA memory: https://github.com/pytorch/vision/issues/7504 https://github.com/pytorch/vision/pull/7506 So @nlgranger had to replace `torch.tensor(..., device = my_cuda_device)` by `torch.tensor(...).to(device = my_cuda_device, non_blocking = True)` I think, it would be more idiomatic/clean to allow `non_blocking` argument directly on torch.tensor/torch.as_tensor ### Alternatives _No response_ ### Additional context _No response_ cc @gchanan @mruberry

KFrank · August 2, 2023, 5:00pm

Hi Nicolas!

This is to be expected (at least by me). If x resides on the cpu, then the
cpu will be busy and “block” until it is done with its part of copying the data
to the gpu.

But I believe that torch.tensor() is non-blocking in the sense that it will
return before the gpu is done with whatever work it has to do. However,
this non-blocking behavior is hidden because the cpu data-transfer time is
the bottleneck compared to any pure gpu work (at least on my test system).

The non-blocking behavior becomes apparent when x itself resides on the
gpu (or presumably on a second gpu).

Here is a timing script:

import torch
print (torch.__version__)
print (torch.version.cuda)
print (torch.cuda.get_device_name())

import time

import warnings
warnings.filterwarnings ('ignore')   # clean up output

for  source_device in ('cpu', 'cuda'):
    print ('source_device:', source_device)
    
    for  iMeg in (100, 200, 400, 800):
        if  source_device == 'cuda' and iMeg > 400:   # avoid out of memory
            break
        t_source = torch.randn (iMeg, 1000, 1000, device = source_device)
        
        # warmup
        tc0 = torch.tensor (t_source, device = 'cuda')
        tc0 = None
        
        torch.cuda.synchronize()   # make sure gpu is ready
        t0 = time.time()
        tc1 = torch.tensor (t_source, device = 'cuda')
        t1 = time.time()
        print ('iMeg:', iMeg, '  t_nosync: ', t1 - t0)
        
        torch.cuda.synchronize()
        tc1 = None
        
        torch.cuda.synchronize()   # make sure gpu is ready
        t0 = time.time()
        tc2 = torch.tensor (t_source, device = 'cuda')
        torch.cuda.synchronize()   # wait for torch.tensor() to actually finish
        t1 = time.time()
        print ('iMeg:', iMeg, '  t_sync:   ', t1 - t0)
        
        tc2 = None

        t_source = None

And here is its output:

2.0.1
11.8
GeForce GTX 1050 Ti
source_device: cpu
iMeg: 100   t_nosync:  0.06376481056213379
iMeg: 100   t_sync:    0.06362557411193848
iMeg: 200   t_nosync:  0.12702536582946777
iMeg: 200   t_sync:    0.1270275115966797
iMeg: 400   t_nosync:  0.2543323040008545
iMeg: 400   t_sync:    0.25426554679870605
iMeg: 800   t_nosync:  0.5085372924804688
iMeg: 800   t_sync:    0.5087063312530518
source_device: cuda
iMeg: 100   t_nosync:  3.981590270996094e-05
iMeg: 100   t_sync:    0.008202552795410156
iMeg: 200   t_nosync:  3.6716461181640625e-05
iMeg: 200   t_sync:    0.016355037689208984
iMeg: 400   t_nosync:  3.719329833984375e-05
iMeg: 400   t_sync:    0.03267312049865723

The fact that the gpu → gpu “nosync” timings are much shorter than the
analogous “sync” timings shows that torch.tensor (x, device = 'cuda')
returns asynchronously when x resides on the gpu.

Best.

K. Frank

nlgranger · August 3, 2023, 7:53am

the cpu data-transfer time is the bottleneck compared to any pure gpu work (at least on my test system).

That’s my problem, it did not seem to be the case on a compute node with A100 and a fairly large model, but just like you, I cannot reproduce it locally.