Cuda stream not in parallel for element-wise/matrix extraction

Dear all,

I’m trying to perform element-wise operations (torch.multiply) or extract weight matrices during torch.matmul using a different stream. However, after analyzing the CUDA kernel with Nsight System, I realized these operations can’t be parallelized; for example, I can’t asynchronously run torch.multiply while performing torch.matmul or asynchronously run 2 torch.multiply in 2 streams.

My guess is that CUDA requires the operation to be large enough to launch asynchronously, but I lack sufficient GPU DRAM or threads for this. Conversely, when I have adequate GPU resources, small operations don’t launch asynchronously. How can this issue be resolved? Thank you!

Here is my system info:

print(“PyTorch version:”, torch.version)
print(“CUDA available:”, torch.cuda.is_available())
if torch.cuda.is_available():
print(“CUDA version:”, torch.version.cuda)
print(torch.version)
print(torch.version.cuda)
print(torch.backends.cudnn.version())
‘’’
PyTorch version: 2.0.1+cu117
CUDA available: True
CUDA version: 11.7
2.0.1+cu117
11.7
8906
System: ubuntu 20.04
GPU: NVIDIA GeForce RTX 4090
‘’’

Here is my code:

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
a = torch.rand(1000, 1000, device=device, dtype=torch.float16)
b = torch.rand(1000, 1000, device=device, dtype=torch.float16)
a_large = torch.rand(10000, 10000, device=device, dtype=torch.float16)
b_large = torch.rand(10000, 10000, device=device, dtype=torch.float16)

# Create two CUDA streams
stream1 = torch.cuda.Stream()
stream2 = torch.cuda.Stream()

# ----test multiply async
with torch.cuda.stream(stream1):
    for i in range(100000):
        result1 = torch.multiply(a, a)  # a * a
       
with torch.cuda.stream(stream2):
    for i in range(100000):
        result2 = torch.multiply(b, b)  # b * b

torch.cuda.synchronize()

# ----test matmul, multiply async
layer1 = torch.nn.Linear(80, 80, device=device)
layer2 = torch.nn.Linear(80, 80, device=device)
input_tensor = torch.randn(128, 80, device=device)

with torch.cuda.stream(stream1):
    for i in range(100000):
        result1 = layer2(layer1(input_tensor))  # layer1(input_tensor)

with torch.cuda.stream(stream2):
    for i in range(100000):
        result2 = torch.multiply(b, b)  # b * b

torch.cuda.synchronize()

# ----test matmul, select indices async
layer1 = torch.nn.Linear(80, 80, device=device)
layer2 = torch.nn.Linear(80, 80, device=device)
input_tensor = torch.randn(128, 80, device=device)
select_indices = torch.arange(500, device=device)

with torch.cuda.stream(stream1):
    for i in range(100000):
        result1 = layer2(layer1(input_tensor))  # layer1(input_tensor)

with torch.cuda.stream(stream2):
    for i in range(100000):
        _ = b[:, select_indices]

torch.cuda.synchronize()

# ----test matmul, select indices async (large matrix)
layer1 = torch.nn.Linear(4000, 4000, device=device)
layer2 = torch.nn.Linear(4000, 4000, device=device)
input_tensor = torch.randn(128, 4000, device=device)
select_indices = torch.arange(5000, device=device)

with torch.cuda.stream(stream1):
    for i in range(100000):
        result1 = layer2(layer1(input_tensor))  # layer1(input_tensor)

with torch.cuda.stream(stream2):
    for i in range(100000):
        _ = b_large[:, select_indices]

torch.cuda.synchronize()
exit()

I ran it with nsys profile -w true --gpu-metrics-device=0 -x true --force-overwrite=true -o my_profile_simpletest python simpletest.py

Is there a way to make CUDA directly parallelize small operations in pytroch?

This post might be helpful explaining the compute resources with a link to a great GTC talk explaining it in more details.

thanks, I will take a look. I’ve noticed that I cannot limit CUDA usage in PyTorch, for example, limit the maximum GPU thread usage by a stream. I want to double-check whether this is correct.

I assume you mean you cannot limit the sm usage? If so, then yes, you cannot limit it in PyTorch and all kernels will have the ability to use all available compute resources.