I conducted a simple experiment on DL inference as follows:
warm up
model.cuda()
img = img.cuda()
with torch.no_grad:
for _ in range(100):
y = model(img)
torch.cuda.synchronize()
@torch.no_grad()
def test():
for _ in range(10):
out = model(img)
torch.cuda.synchronize()
case 1
test() case 2
child_thread = Threading.Thread(target=test, …)
child_thread.start()
child_thread.join()
After conducting the analysis, I found that the inference time for Case 2 surpasses that of Case 1 significantly . Further profiling of the program using Nsight System revealed that there is no cudaMalloc function call in Case 1 during inference, whereas a substantial amount of cudaMalloc call are present in Case 2. This prompts the inquiry into the relationship between multi-threading and memory management in PyTorch. Could you give some suggestions? Thanks so much !
Just a hunch, but I would check if this is because the CUDACachingAllocator doesn’t mix allocations across different threads/streams. Do you observe the same behavior when also doing the warmup on the child thread?
Thanks, I have modified some code in CUDACachingAllocator to mix allocations across different streams successfully. Now , I want to achieve that across different threads. How can I do that