How could I copy data from cpu to gpu asynchronically?

coincheung · March 26, 2021, 2:07am

Hi,

My test code is like this:

a = torch.randn(16, 3, 16, 16)
a1 = a.cuda(1)
a2 = a.cuda(2)
a3 = a.cuda(3)
a4 = a.cuda(4)
torch.cuda.synchronize()

a = torch.randn(16, 3, 1204, 1024)
t1 = time.time()
a1 = a.cuda(1)
a2 = a.cuda(2)
a3 = a.cuda(3)
a4 = a.cuda(4)
t2 = time.time()
torch.cuda.synchronize()
print(t2 - t1)

a = torch.randn(16, 3, 1204, 1024)
t3 = time.time()
a1 = a.cuda(1)
t4 = time.time()
torch.cuda.synchronize()
print(t4 - t3)

The results is:

0.18216228485107422
0.03782033920288086

Which means that copying one cpu tensor to one gpu tensor is faster than copying same tensor to multi-gpu, the process seems to be a one-by-one operation. How could I make the copy in parallel please?

ptrblck · March 26, 2021, 9:17am

You would have to synchronize the code before starting and stopping the timer, while your current code stops the timers already and synchronizes afterwards.

coincheung · March 26, 2021, 10:08am

Hi,

Will this line work to synchronize ?

ptrblck · March 26, 2021, 10:43am

Yes, this will synchronize the default device.

coincheung · March 27, 2021, 2:14am

So what is the most efficient way to copy one tensor to multi-gpus in parallel ?

ptrblck · March 27, 2021, 7:01am

The copy code looks alright, but your profiling code is still wrong, since you are stopping the timer without a previous synchronization. Did you create new profiles and checked the time?

coincheung · March 27, 2021, 1:01pm

HI,

I changed my code like this:

from concurrent.futures import ThreadPoolExecutor

executor = ThreadPoolExecutor(max_workers=8)


a = torch.randn(16, 3, 16, 16)
a1 = a.cuda(1)
a2 = a.cuda(2)
a3 = a.cuda(3)
a4 = a.cuda(4)

a = torch.randn(16, 3, 1204, 1024)
def move(device_id):
    data = a.cuda(device_id)
    torch.cuda.synchronize(device_id)
    return data

torch.cuda.synchronize(1)
torch.cuda.synchronize(2)
torch.cuda.synchronize(3)
torch.cuda.synchronize(4)
t1 = time.time()
#  a1 = a.cuda(1)
#  a2 = a1.cuda(2)
#  a3 = a1.cuda(3)
#  a4 = a1.cuda(4)
#  a1 = a.cuda(1, non_blocking=True)
#  a2 = a.cuda(2, non_blocking=True)
#  a3 = a.cuda(3, non_blocking=True)
#  a4 = a.cuda(4, non_blocking=True)
a1, a2, a3, a4 = list(executor.map(move, [1, 2, 3, 4]))
torch.cuda.synchronize(1)
torch.cuda.synchronize(2)
torch.cuda.synchronize(3)
torch.cuda.synchronize(4)
t2 = time.time()
print(t2 - t1)

print(a1.device)
print(a2.device)
print(a3.device)
print(a4.device)

a = torch.randn(16, 3, 1204, 1024)
t3 = time.time()
a1 = a.cuda(1)
torch.cuda.synchronize(1)
t4 = time.time()
print(t4 - t3)

And what printed is like this:

0.06666159629821777
cuda:1
cuda:2
cuda:3
cuda:4
0.02723860740661621

It seems that the operation is not in parallel, how could I refine this please ?