cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [381,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.RuntimeError: CUDA error: device-side assert triggered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Cou

Multiple models of multiprocessing pool instances are used to reason about concurrent requests in a multithreaded environment. At the beginning, everything is fine, and after the program runs for about 10 or 20 minutes (the error occurs at a different time when the program restarts after throwing an exception), an exception is thrown:

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [180,0,0], thread: [92,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [180,0,0], thread: [93,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [210,0,0], thread: [92,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [210,0,0], thread: [93,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [210,0,0], thread: [94,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [210,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [381,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [381,0,0], thread: [65,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [381,0,0], thread: [66,0,0] Assertion srcIndex < srcSelectDimSize failed.

RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Could you please tell me how to solve this problem?
thanks!

Based on the error message you are running into an indexing error and would need to narrow down which operation fails. Using CUDA_LAUNCH_BLOCKING=1 usually helps in isolating with call triggers this error, but I’m not sure how it would behave in your custom multi-processing/mult-threading env.