How could I copy data from cpu to gpu asynchronically?

Hi,

My test code is like this:

a = torch.randn(16, 3, 16, 16)
a1 = a.cuda(1)
a2 = a.cuda(2)
a3 = a.cuda(3)
a4 = a.cuda(4)
torch.cuda.synchronize()

a = torch.randn(16, 3, 1204, 1024)
t1 = time.time()
a1 = a.cuda(1)
a2 = a.cuda(2)
a3 = a.cuda(3)
a4 = a.cuda(4)
t2 = time.time()
torch.cuda.synchronize()
print(t2 - t1)

a = torch.randn(16, 3, 1204, 1024)
t3 = time.time()
a1 = a.cuda(1)
t4 = time.time()
torch.cuda.synchronize()
print(t4 - t3)

The results is:

0.18216228485107422
0.03782033920288086

Which means that copying one cpu tensor to one gpu tensor is faster than copying same tensor to multi-gpu, the process seems to be a one-by-one operation. How could I make the copy in parallel please?

You would have to synchronize the code before starting and stopping the timer, while your current code stops the timers already and synchronizes afterwards.

Hi,

Will this line work to synchronize ?

Yes, this will synchronize the default device.

So what is the most efficient way to copy one tensor to multi-gpus in parallel ?

The copy code looks alright, but your profiling code is still wrong, since you are stopping the timer without a previous synchronization. Did you create new profiles and checked the time?

HI,

I changed my code like this:

from concurrent.futures import ThreadPoolExecutor

executor = ThreadPoolExecutor(max_workers=8)


a = torch.randn(16, 3, 16, 16)
a1 = a.cuda(1)
a2 = a.cuda(2)
a3 = a.cuda(3)
a4 = a.cuda(4)

a = torch.randn(16, 3, 1204, 1024)
def move(device_id):
    data = a.cuda(device_id)
    torch.cuda.synchronize(device_id)
    return data

torch.cuda.synchronize(1)
torch.cuda.synchronize(2)
torch.cuda.synchronize(3)
torch.cuda.synchronize(4)
t1 = time.time()
#  a1 = a.cuda(1)
#  a2 = a1.cuda(2)
#  a3 = a1.cuda(3)
#  a4 = a1.cuda(4)
#  a1 = a.cuda(1, non_blocking=True)
#  a2 = a.cuda(2, non_blocking=True)
#  a3 = a.cuda(3, non_blocking=True)
#  a4 = a.cuda(4, non_blocking=True)
a1, a2, a3, a4 = list(executor.map(move, [1, 2, 3, 4]))
torch.cuda.synchronize(1)
torch.cuda.synchronize(2)
torch.cuda.synchronize(3)
torch.cuda.synchronize(4)
t2 = time.time()
print(t2 - t1)

print(a1.device)
print(a2.device)
print(a3.device)
print(a4.device)

a = torch.randn(16, 3, 1204, 1024)
t3 = time.time()
a1 = a.cuda(1)
torch.cuda.synchronize(1)
t4 = time.time()
print(t4 - t3)

And what printed is like this:

0.06666159629821777
cuda:1
cuda:2
cuda:3
cuda:4
0.02723860740661621

It seems that the operation is not in parallel, how could I refine this please ?