Multi-thread and cudaMalloc

I conducted a simple experiment on DL inference as follows:

warm up

model.cuda()
img = img.cuda()
with torch.no_grad:
for _ in range(100):
y = model(img)
torch.cuda.synchronize()

@torch.no_grad()
def test():
for _ in range(10):
out = model(img)
torch.cuda.synchronize()

case 1
test()
case 2
child_thread = Threading.Thread(target=test, …)
child_thread.start()
child_thread.join()

After conducting the analysis, I found that the inference time for Case 2 surpasses that of Case 1 significantly . Further profiling of the program using Nsight System revealed that there is no cudaMalloc function call in Case 1 during inference, whereas a substantial amount of cudaMalloc call are present in Case 2. This prompts the inquiry into the relationship between multi-threading and memory management in PyTorch. Could you give some suggestions? Thanks so much !

Just a hunch, but I would check if this is because the CUDACachingAllocator doesn’t mix allocations across different threads/streams. Do you observe the same behavior when also doing the warmup on the child thread?

1 Like

You are right! When I doing the warmup on the child thread, there is no cudaMalloc call during inference

I have mix allocation across different streams for my development. I wonder how can I mix allocation across different threads?

In general I think this is a tricky use-case. I would take a look at the record_stream and wait_stream functions for managing tensor lifetypes and ordering respectively:
https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html
https://pytorch.org/docs/stable/generated/torch.cuda.Stream.html#torch.cuda.Stream.wait_stream

Thanks, I have modified some code in CUDACachingAllocator to mix allocations across different streams successfully. Now , I want to achieve that across different threads. How can I do that