Multiple models of multiprocessing pool instances are used to reason about concurrent requests in a multithreaded environment. At the beginning, everything is fine, and after the program runs for about 10 or 20 minutes (the error occurs at a different time when the program restarts after throwing an exception), an exception is thrown:
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [180,0,0], thread: [92,0,0] Assertion srcIndex < srcSelectDimSize
failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [180,0,0], thread: [93,0,0] Assertion srcIndex < srcSelectDimSize
failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [210,0,0], thread: [92,0,0] Assertion srcIndex < srcSelectDimSize
failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [210,0,0], thread: [93,0,0] Assertion srcIndex < srcSelectDimSize
failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [210,0,0], thread: [94,0,0] Assertion srcIndex < srcSelectDimSize
failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [210,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize
failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [381,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize
failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [381,0,0], thread: [65,0,0] Assertion srcIndex < srcSelectDimSize
failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [381,0,0], thread: [66,0,0] Assertion srcIndex < srcSelectDimSize
failed.
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Could you please tell me how to solve this problem?
thanks!