Multi-thread and cudaMalloc

I conducted a simple experiment on DL inference as follows:

warm up

img = img.cuda()
with torch.no_grad:
for _ in range(100):
y = model(img)

def test():
for _ in range(10):
out = model(img)

case 1
case 2
child_thread = Threading.Thread(target=test, …)

After conducting the analysis, I found that the inference time for Case 2 surpasses that of Case 1 significantly . Further profiling of the program using Nsight System revealed that there is no cudaMalloc function call in Case 1 during inference, whereas a substantial amount of cudaMalloc call are present in Case 2. This prompts the inquiry into the relationship between multi-threading and memory management in PyTorch. Could you give some suggestions? Thanks so much !

Just a hunch, but I would check if this is because the CUDACachingAllocator doesn’t mix allocations across different threads/streams. Do you observe the same behavior when also doing the warmup on the child thread?

You are right! When I doing the warmup on the child thread, there is no cudaMalloc call during inference

I have mix allocation across different streams for my development. I wonder how can I mix allocation across different threads?

In general I think this is a tricky use-case. I would take a look at the record_stream and wait_stream functions for managing tensor lifetypes and ordering respectively:

Thanks, I have modified some code in CUDACachingAllocator to mix allocations across different streams successfully. Now , I want to achieve that across different threads. How can I do that