Using GPU takes even longer time for broadcasting in PyTorch

vainaijr · August 16, 2019, 10:25pm

print("CUDA available: ", torch.cuda.is_available())
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

CUDA available: True

x = torch.tensor([1, 2, 3]).to(device)
y = torch.tensor([1, 2, 3])

%%timeit -n 10000
z = x + 1

10000 loops, best of 3: 15.6 µs per loop

%%timeit -n 10000
z = y + 1

10000 loops, best of 3: 5.19 µs per loop

ptrblck · August 16, 2019, 10:29pm

Note that CUDA operations are asynchronously executed, so you would need to time the operations in a manual loop and use torch.cuda.synchronize() before starting and stopping the timer.
Also, the workload is tiny, thus the kernel launch times might even be larger than the actual execution time.

vainaijr · August 16, 2019, 10:41pm

do you mean something like this?

torch.cuda.synchronize(device=device) # before starting the timer

begin = time.time()

for i in range(1000000):

  z = x + 1

torch.cuda.synchronize(device=device) # before ending the timer

end = time.time()

total = (end - begin)/1000000; total

1.4210415363311767e-05

begin = time.time()

for i in range(1000000):

  z = y + 1

end = time.time()

total = (end - begin)/1000000; total

5.057124853134155e-06

still GPU takes longer time, so should I carry broadcasting operations on CPU, when using relatively small shape tensors?

ptrblck · August 16, 2019, 11:16pm

Like I said, it depends on the workload.
If you would like to use only 3 numbers and your tensor is already on the CPU, just use the CPU.
Here is a quick comparison:

def time_gpu(size, nb_iter):
    x = torch.randn(size).to('cuda')
    torch.cuda.synchronize() # before starting the timer
    begin = time.time()

    for i in range(nb_iter):
        z = x + 1

    torch.cuda.synchronize() # before ending the timer
    end = time.time()
    total = (end - begin)/nb_iter
    return total


def time_cpu(size, nb_iter):
    y = torch.randn(size)
    begin = time.time()
    for i in range(nb_iter):
        z = y + 1

    end = time.time()
    total = (end - begin)/nb_iter
    return total


sizes = [10**p for p in range(9)]

times_gpu = []
for size in sizes:
    times_gpu.append(time_gpu(size, 10000))

times_cpu = []
for size in sizes:
    times_cpu.append(time_cpu(size, 1000))


import numpy as np
import matplotlib.pyplot as plt

fig, axarr = plt.subplots(2, 1)

axarr[0].plot(np.log10(np.array(sizes)),
              np.array(times_gpu),
              label='gpu')
axarr[0].plot(np.log10(np.array(sizes)),
              np.array(times_cpu),
              label='cpu')
axarr[0].legend()
axarr[0].set_xlabel('log10(size)')
axarr[0].set_ylabel('time in seconds')

axarr[1].plot(np.log10(np.array(sizes)),
              np.log10(np.array(times_gpu)),
              label='gpu')
axarr[1].plot(np.log10(np.array(sizes)),
              np.log10(np.array(times_cpu)),
              label='cpu')
axarr[1].legend()
axarr[1].set_xlabel('log10(size)')
axarr[1].set_ylabel('log10(time)')

broadcast_gpu_cpu

As you can see, the broadcast operation will have an approx. constant overhead up to a size of 10**5.
Using bigger tensors is a magnitude faster on the GPU.