GPU vs CPU speed in torch.histc

saluei · May 24, 2020, 8:05pm

I have two functions for calculating entropy in first one (named: ImageEntropy_gpu) use torch.histc function and run it in GPU (nvidia 2070) and in second one (named: ImageEntropy_cpu) use numpy library and run at CPU (i7-7800x)

def ImageEntropy_gpu(img):
sz=img.view(-1).size()[0]
hist_probability=torch.histc(img.view(-1), bins=256)/sz
nonzero_probability=hist_probability[hist_probability>0]
entropy =-torch.sum( torch.mul(nonzero_probability, torch.log2(nonzero_probability))).item()
return round(entropy,4)

def ImageEntropy_cpu(img):
marg = np.histogramdd(np.ravel(img), bins = 256)[0]/img.size
marg = list(filter(lambda p: p > 0, np.ravel(marg)))
entropy = -np.sum(np.multiply(marg, np.log2(marg)))
return round(entropy,4)
CPU version function (ImageEntropy_cpu) run 20 time FASTER than GPU version (ImageEntropy_gpu)!!!
can someone explain WHY this happen, may be something is wrong in these two function ??

ptrblck · May 24, 2020, 11:52pm

The performance will depend on the workload you are deploying to the device.
While small workloads will be faster on the CPU due to the kernel launch latency on the GPU, you should see a speedup for bigger sizes.
Using this code:

nb_iters = 100

for device in ['cpu', 'cuda']:
    for s in torch.logspace(1, 8, steps=8).int():
        img = torch.randint(0, 256, (s,), device=device).float()
        torch.cuda.synchronize()
        t0 = time.time()
        for _ in range(nb_iters):
            out = torch.histc(img, bins=256)
        torch.cuda.synchronize()
        t1 = time.time()
        print('device {}, size {}, time {}'.format(
            device, s, (t1 - t0)/nb_iters))

I see these results on an Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz and a V100:

device cpu, size 10, time 8.225440979003906e-06
device cpu, size 100, time 7.455348968505859e-06
device cpu, size 1000, time 1.646280288696289e-05
device cpu, size 10000, time 0.00010733842849731445
device cpu, size 100000, time 0.0010244464874267578
device cpu, size 1000000, time 0.010178050994873046
device cpu, size 10000000, time 0.10184217929840088
device cpu, size 100000000, time 1.036203260421753
device cuda, size 10, time 0.00018221616744995117
device cuda, size 100, time 0.00018065452575683594
device cuda, size 1000, time 0.00018761157989501953
device cuda, size 10000, time 0.00019278764724731446
device cuda, size 100000, time 0.00019084692001342774
device cuda, size 1000000, time 0.00026309967041015625
device cuda, size 10000000, time 0.0009484291076660156
device cuda, size 100000000, time 0.007037463188171386

saluei · May 25, 2020, 4:24am

Thank’s , your code and reply is perfect,
I don’t know about “kernel launch latency on the GPU” that is total latency to run kernels on GPU and
data transmission latency between CPU and GPU.
Thank you