After inspecting the generated xxx.wrapper.cpp I see something like:
static constexpr int64_t int_array_1[] = {1L, };
AtenTensorHandle pool2_handle;
AOTI_TORCH_ERROR_CODE_CHECK(
aoti_torch_empty_strided(
1, int_array_0, int_array_1,
cached_torch_dtype_uint8,
cached_torch_device_type_cuda,
this->device_idx_,
&pool2_handle));
RAIIAtenTensorHandle pool2(pool2_handle);
// … later alloc_from_pool calls …
pool2.reset();
After loader1 finishes capturing its CUDA graph, loader2 may still allocate the same virtual address. If I then run both loaders concurrently on two different streams, could this cause memory corruption or undefined behavior?