A series questions caused by torch.load and tensor.to(device) during grouped_gemm on multiple gpus

I build several tensor groups to test the grouped_gemm. the program went well on single gpu, but when I change the environment to multiple gpus and load the tensors to different device at different time, the program would error and have different errors each time:
the 1st case:

def randn(bs, x, y, device_id):
out = (torch.rand(bs, x, y) - 0.5 * 2) / (y * x)
device = torch.device(“cuda:%d”%device_id if torch.cuda.is_available() else “cpu”)
return out.to(device).to(torch.bfloat16)
a = randn(bs, x, y, device_id)
b = randn(bs, x, y, device_id)

yes, I load the tensor to the specific device when I build it, and the error would be:
group_1: RuntimeError: Failed to run CUTLASS Grouped GEMM
group_2…n:
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

the 2nd case:

def randn(bs, x, y):
out = (torch.rand(bs, x, y) - 0.5 * 2) / (y * x)
return out.cuda().to(torch.bfloat16)
a = a.to(torch.device(‘cuda:%d’%device_id))
b = b.to(torch.device(‘cuda:%d’%device_id))

I build the tensors on default gpu and then load them to specific gpus, the program went well!

the 3rd case:

a = torch.load(a_path,map_location=torch.device('cuda:%d'%device_id))
b = torch.load(b_path,map_location=torch.device('cuda:%d'%device_id))

when the device_id is 0, it went well, but if not, it would error at following calculation at the backend.gmm: RuntimeError: Failed to run CUTLASS Grouped GEMM

*and even I settle the device_id as 0 forever, the next tensor group would have the same error

why would this happen, so weird

Could you run your first code snippet via cuda-gdb and post the backtrace of the illegal memory access here?

Thank you for your reply! The following is the output of the 1st test using cuda-gdb

[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib/x86_64-linux-gnu/libthread_db.so.1”.

[New Thread 0x7fff60ac4700 (LWP 1001)]
[New Thread 0x7fff602c3700 (LWP 1002)]
[New Thread 0x7fff5bac2700 (LWP 1003)]
[New Thread 0x7fff55375700 (LWP 1004)]
[Thread 0x7fff5bac2700 (LWP 1003) exited]
[Thread 0x7fff602c3700 (LWP 1002) exited]
[Thread 0x7fff60ac4700 (LWP 1001) exited]
[Detaching after fork from child process 1005]
[New Thread 0x7fff5bac2700 (LWP 1012)]
[New Thread 0x7fff602c3700 (LWP 1013)]
Tensor created successfully on GPU 0
[New Thread 0x7fff60ac4700 (LWP 1015)]
Tensor created successfully on GPU 1
[New Thread 0x7fff48af1700 (LWP 1016)]
Tensor created successfully on GPU 2
[New Thread 0x7fff41dff700 (LWP 1017)]
Tensor created successfully on GPU 3
[New Thread 0x7fff4115d700 (LWP 1018)]
Tensor created successfully on GPU 4
[New Thread 0x7fff2dfff700 (LWP 1019)]
Tensor created successfully on GPU 5
[New Thread 0x7fff2d7fe700 (LWP 1020)]
Tensor created successfully on GPU 6
[New Thread 0x7fff2cb5c700 (LWP 1021)]
Tensor created successfully on GPU 7
GPU 3 memory summary:
|===========================================================================|

PyTorch CUDA memory summary, device ID 3
CUDA OOMs: 0
===========================================================================
Metric
---------------------------------------------------------------------------
Allocated memory
from large pool
from small pool
---------------------------------------------------------------------------
Active memory
from large pool
from small pool
---------------------------------------------------------------------------
Requested memory
from large pool
from small pool
---------------------------------------------------------------------------
GPU reserved memory
from large pool
from small pool
---------------------------------------------------------------------------
Non-releasable memory
from large pool
from small pool
---------------------------------------------------------------------------
Allocations
from large pool
from small pool
---------------------------------------------------------------------------
Active allocs
from large pool
from small pool
---------------------------------------------------------------------------
GPU reserved segments
from large pool
from small pool
---------------------------------------------------------------------------
Non-releasable allocs
from large pool
from small pool
---------------------------------------------------------------------------
Oversize allocations
---------------------------------------------------------------------------
Oversize GPU segments
===========================================================================

a.device: cuda:3
–Type for more, q to quit, c to continue without paging–

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7fff27455280

Thread 1 “python” received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 2, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 1, lane 0]
0x00007fff27455310 in void cutlass::Kernel<cutlass::gemm::kernel::GemmGrouped<cutlass::gemm::threadblock::MmaMultistage<cutlass::gemm::GemmShape<128, 128, 32>, cutlass::transform::threadblock::PredicatedTileAccessIterator<cutlass::MatrixShape<128, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajor, 1, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<32, 128>, 128, cutlass::PitchLinearShape<4, 8>, 8>, cutlass::Array<cutlass::bfloat16_t, 8, false>, false, cutlass::layout::NoPermute>, cutlass::transform::threadblock::RegularTileAccessIterator<cutlass::MatrixShape<128, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<16, 32>, 0, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<32, 128>, 128, cutlass::PitchLinearShape<4, 8>, 8>, 16>, (cutlass::arch::CacheOperation::Kind)1, cutlass::transform::threadblock::PredicatedTileAccessIterator<cutlass::MatrixShape<32, 128>, cutlass::bfloat16_t, cutlass::layout::RowMajor, 0, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<128, 32>, 128, cutlass::PitchLinearShape<8, 4>, 8>, cutlass::Array<cutlass::bfloat16_t, 8, false>, false, cutlass::layout::NoPermute>, cutlass::transform::threadblock::RegularTileAccessIterator<cutlass::MatrixShape<32, 128>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCongruous<16, 64>, 0, cutlass::transform::PitchLinearWarpRakedT–Type for more, q to quit, c to continue without paging–
hreadMap<cutlass::PitchLinearShape<128, 32>, 128, cutlass::PitchLinearShape<8, 4>, 8>, 16>, (cutlass::arch::CacheOperation::Kind)1, float, cutlass::layout::RowMajor, cutlass::gemm::threadblock::MmaPolicy<cutlass::gemm::warp::MmaTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<16, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCongruous<16, 64>, float, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 16>, 32, cutlass::bfloat16_t, cutlass::layout::RowMajor, cutlass::bfloat16_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1> >, 1, false, bool>, cutlass::MatrixShape<0, 0>, cutlass::MatrixShape<0, 0>, 1>, 4, (cutlass::gemm::SharedMemoryClearOption)0, bool>, cutlass::epilogue::threadblock::Epilogue<cutlass::gemm::GemmShape<128, 128, 32>, cutlass::gemm::warp::MmaTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<16, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCongruous<16, 64>, float, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 16>, 32, cutlass::bfloat16_t, cutlass::layout::RowMajor, cutlass::bfloat16_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1> >, 1, false, bool>, 1, cutlass::epilogue::threadblock::PredicatedTileIterator<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<128, 8, 2, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 8, 1, 1, 8>, 128, 8, 16>, cutlass::bfloat16_t, false, cutlass::layout::NoPermute, false>, cutlass::epilogue::warp::FragmentIteratorTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::gemm::GemmShape<16, 8, 16>, float, cutlass::Array<float, 4, true>, cutlass::layout::RowMajor>, cutlass::epilogue::warp::TileIteratorTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::gemm::GemmShape<16, 8, 16>, float, cutlass::layout::RowMajor>, cutlass::epilogue::threadblock::SharedLoadIterator<cutlass::epil–Type for more, q to quit, c to continue without paging–
ogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<128, 8, 2, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 8, 1, 1, 8>, 128, 8, 16>::CompactedThreadMap, float, 32>, cutlass::epilogue::thread::LinearCombination<cutlass::bfloat16_t, 8, float, float, (cutlass::epilogue::thread::ScaleType::Kind)0, (cutlass::FloatRoundStyle)2, cutlass::bfloat16_t>, cutlass::MatrixShape<0, 8>, 1, 1>, cutlass::gemm::threadblock::GemmBatchedIdentityThreadblockSwizzle, (cutlass::gemm::kernel::GroupScheduleMode)0, false> >(cutlass::gemm::kernel::GemmGrouped<cutlass::gemm::threadblock::MmaMultistage<cutlass::gemm::GemmShape<128, 128, 32>, cutlass::transform::threadblock::PredicatedTileAccessIterator<cutlass::MatrixShape<128, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajor, 1, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<32, 128>, 128, cutlass::PitchLinearShape<4, 8>, 8>, cutlass::Array<cutlass::bfloat16_t, 8, false>, false, cutlass::layout::NoPermute>, cutlass::transform::threadblock::RegularTileAccessIterator<cutlass::MatrixShape<128, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<16, 32>, 0, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<32, 128>, 128, cutlass::PitchLinearShape<4, 8>, 8>, 16>, (cutlass::arch::CacheOperation::Kind)1, cutlass::transform::threadblock::PredicatedTileAccessIterator<cutlass::MatrixShape<32, 128>, cutlass::bfloat16_t, cutlass::layout::RowMajor, 0, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<128, 32>, 128, cutlass::PitchLinearShape<8, 4>, 8>, cutlass::Array<cutlass::bfloat16_t, 8, false>, false, cutlass::layout::NoPermute>, cutlass::transform::threadblock::RegularTileAccessIterator<cutlass::MatrixShape<32, 128>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCongruous<16, 64>, 0, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<128, 32>, 128, cutlass::PitchLinearShape<8, 4>, 8>, 16>, (cutlass::arch::CacheOperation::Kind)1, float, cutlass::layout::RowMajor, cutlass::gemm::threadblock::MmaPolicy<cutlass::gemm::warp::MmaTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCrosswis–Type for more, q to quit, c to continue without paging–
e<16, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCongruous<16, 64>, float, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 16>, 32, cutlass::bfloat16_t, cutlass::layout::RowMajor, cutlass::bfloat16_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1> >, 1, false, bool>, cutlass::MatrixShape<0, 0>, cutlass::MatrixShape<0, 0>, 1>, 4, (cutlass::gemm::SharedMemoryClearOption)0, bool>, cutlass::epilogue::threadblock::Epilogue<cutlass::gemm::GemmShape<128, 128, 32>, cutlass::gemm::warp::MmaTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<16, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCongruous<16, 64>, float, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 16>, 32, cutlass::bfloat16_t, cutlass::layout::RowMajor, cutlass::bfloat16_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1> >, 1, false, bool>, 1, cutlass::epilogue::threadblock::PredicatedTileIterator<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<128, 8, 2, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 8, 1, 1, 8>, 128, 8, 16>, cutlass::bfloat16_t, false, cutlass::layout::NoPermute, false>, cutlass::epilogue::warp::FragmentIteratorTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::gemm::GemmShape<16, 8, 16>, float, cutlass::Array<float, 4, true>, cutlass::layout::RowMajor>, cutlass::epilogue::warp::TileIteratorTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::gemm::GemmShape<16, 8, 16>, float, cutlass::layout::RowMajor>, cutlass::epilogue::threadblock::SharedLoadIterator<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<128, 8, 2, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 8, 1, 1, 8>, 128, 8, 16>::CompactedThreadMap, float, 32>, cutlass::epilogue::thread::LinearCombination<cutlass::bfloat16_t, 8, float, float, (cutlass::epilogue::thread::ScaleType::Kind)0–Type for more, q to quit, c to continue without paging–
, (cutlass::FloatRoundStyle)2, cutlass::bfloat16_t>, cutlass::MatrixShape<0, 8>, 1, 1>, cutlass::gemm::threadblock::GemmBatchedIdentityThreadblockSwizzle, (cutlass::gemm::kernel::GroupScheduleMode)0, false>::Params)<<<(1,1,1),(128,1,1)>>> ()

Thank you! Would you mind creating an issue on GitHub ideally with a minimal code snippet to reproduce the issue?
I don’t know if we are missing to set the current device inside these CUTLASS kernels properly or what might be causing the memory violation.

Thank you! I put the code snippet in the issue:Cuda memory error when testing GroupedGemm on a multi-Gpu environment · Issue #134441 · pytorch/pytorch · GitHub. And all the three snippets and their original test (in a single gpu environment) are placed at GitHub - NiuMa-1234/Test_of_GroupedGemm_on_multiGpus: This is a test to GroupedGemm on multi-gpu environment which has encountered Cuda error

Hello, Im using a Nvidia jetson orin nano and im facing compatibility issues with the cuda and pytorch. i have an ubuntu 22.04 , jetpack= 5.1 but when i check the drivers i get.

| NVIDIA-SMI 540.3.0 Driver Version: N/A CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Orin (nvgpu) N/A | N/A N/A | N/A |
| N/A N/A N/A N/A / N/A | Not Supported | N/A N/A |
| | | N/A |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
could this be the problem? How to solve this?

Yes, it seems your system is unable to detect the GPU so you might want to fix it first before trying to install and use PyTorch.

Hello, thank you for your message. But how do I fix that?. I flashed it already like two times, do I have to flash it once again? If yes ,what details should I pay attention to. If there is some other way could you please let me know.? I’ve been struggling for weeks now. Thanks in advance.

Hello, is there a way to fix that?