Torch.sum does not benefit from parallelism?

This is my time-taken (seconds) vs batch size for 1000 summation operations on a NVIDIA A100-SXM4-80GB

Actual times:


this is my code

import torch
import time

for bs in [4, 8, 16, 32, 64, 128, 256]:
    tensor = torch.randn(bs, 512, 25000).to('cuda')

    start = time.time()
    for i in range(1000):

    print(time.time() - start)


| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:01:00.0 Off |                    0 |
| N/A   46C    P0             302W / 500W |  12816MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |

Am I using too much memory and this is causing things to run slowly or something?

Torch internally uses the numpy sum operations, so if i do the same with numpy array

import torch
import time
import numpy as np

for bs in [4, 8, 16, 32, 64, 128, 256]:
    #tensor = np.random.rand(bs, 512, 25000)
    tensor = np.random.rand(bs, 256, 256)
    start = time.time()
    for i in range(1000):
    print(time.time() - start)


The data of an array has to be stored in memory and there will be certain cpu clocks spent calculate the sum

For tensor level parallelization, you might want to try this

Your profiling is wrong since you are not synchronizing the host timers while trying to profile asynchronous CUDA kernels.
With that being said, these kernels are bandwidth bound so seeing an increase in the kernel duration corresponding to the tensor size would be expected.

1 Like

@ptrblck Can you share some links etc that would explain the CUDA parallelization that you mentioned

Imagine you had a person you go to who does your sums for you. You pass them a box full of numbers and you ask them to sum them up. This person has to sum them one number at a time. If you were to have two people, you can roughly half the time by dumping the numbers from the box equally among the two people, the two people come together at the end and sum their own results. If you have three people, it becomes roughly one third of the work. But no matter how many people you have, if you double the quantity of numbers in the box then the time taken to complete the summation will double. Each person is a CUDA core, and the job of summing the numbers can be called a CUDA kernel.

1 Like

@Brock_Brown Thanks for the explanation. So more like mapReduce. This means that CUDA must be having some some kind of orchestrator module

Time to do some reading on CUDA :slight_smile: