A series questions caused by torch.load and tensor.to(device) during grouped_gemm on multiple gpus

NiuMa-1234 · August 21, 2024, 3:51am

I build several tensor groups to test the grouped_gemm. the program went well on single gpu, but when I change the environment to multiple gpus and load the tensors to different device at different time, the program would error and have different errors each time:
the 1st case:

def randn(bs, x, y, device_id):
out = (torch.rand(bs, x, y) - 0.5 * 2) / (y * x)
device = torch.device(“cuda:%d”%device_id if torch.cuda.is_available() else “cpu”)
return out.to(device).to(torch.bfloat16)
a = randn(bs, x, y, device_id)
b = randn(bs, x, y, device_id)

yes, I load the tensor to the specific device when I build it, and the error would be:
group_1: RuntimeError: Failed to run CUTLASS Grouped GEMM
group_2…n:
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

the 2nd case:

def randn(bs, x, y):
out = (torch.rand(bs, x, y) - 0.5 * 2) / (y * x)
return out.cuda().to(torch.bfloat16)
a = a.to(torch.device(‘cuda:%d’%device_id))
b = b.to(torch.device(‘cuda:%d’%device_id))

I build the tensors on default gpu and then load them to specific gpus, the program went well!

the 3rd case:

a = torch.load(a_path,map_location=torch.device('cuda:%d'%device_id))
b = torch.load(b_path,map_location=torch.device('cuda:%d'%device_id))

when the device_id is 0, it went well, but if not, it would error at following calculation at the backend.gmm: RuntimeError: Failed to run CUTLASS Grouped GEMM

*and even I settle the device_id as 0 forever, the next tensor group would have the same error

why would this happen, so weird

ptrblck · August 21, 2024, 2:09pm

Could you run your first code snippet via cuda-gdb and post the backtrace of the illegal memory access here?

NiuMa-1234 · August 23, 2024, 12:41pm

Thank you for your reply! The following is the output of the 1st test using cuda-gdb

[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib/x86_64-linux-gnu/libthread_db.so.1”.

[New Thread 0x7fff60ac4700 (LWP 1001)]
[New Thread 0x7fff602c3700 (LWP 1002)]
[New Thread 0x7fff5bac2700 (LWP 1003)]
[New Thread 0x7fff55375700 (LWP 1004)]
[Thread 0x7fff5bac2700 (LWP 1003) exited]
[Thread 0x7fff602c3700 (LWP 1002) exited]
[Thread 0x7fff60ac4700 (LWP 1001) exited]
[Detaching after fork from child process 1005]
[New Thread 0x7fff5bac2700 (LWP 1012)]
[New Thread 0x7fff602c3700 (LWP 1013)]
Tensor created successfully on GPU 0
[New Thread 0x7fff60ac4700 (LWP 1015)]
Tensor created successfully on GPU 1
[New Thread 0x7fff48af1700 (LWP 1016)]
Tensor created successfully on GPU 2
[New Thread 0x7fff41dff700 (LWP 1017)]
Tensor created successfully on GPU 3
[New Thread 0x7fff4115d700 (LWP 1018)]
Tensor created successfully on GPU 4
[New Thread 0x7fff2dfff700 (LWP 1019)]
Tensor created successfully on GPU 5
[New Thread 0x7fff2d7fe700 (LWP 1020)]
Tensor created successfully on GPU 6
[New Thread 0x7fff2cb5c700 (LWP 1021)]
Tensor created successfully on GPU 7
GPU 3 memory summary:
|===========================================================================|

PyTorch CUDA memory summary, device ID 3

CUDA OOMs: 0

===========================================================================

Metric

---------------------------------------------------------------------------

Allocated memory

from large pool

from small pool

---------------------------------------------------------------------------

Active memory

from large pool

from small pool

---------------------------------------------------------------------------

Requested memory

from large pool

from small pool

---------------------------------------------------------------------------

GPU reserved memory

from large pool

from small pool

---------------------------------------------------------------------------

Non-releasable memory

from large pool

from small pool

---------------------------------------------------------------------------

Allocations

from large pool

from small pool

---------------------------------------------------------------------------

Active allocs

from large pool

from small pool

---------------------------------------------------------------------------

GPU reserved segments

from large pool

from small pool

---------------------------------------------------------------------------

Non-releasable allocs

from large pool

from small pool

---------------------------------------------------------------------------

Oversize allocations

---------------------------------------------------------------------------

Oversize GPU segments

===========================================================================

a.device: cuda:3
–Type for more, q to quit, c to continue without paging–

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7fff27455280

Thread 1 “python” received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 2, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 1, lane 0]
0x00007fff27455310 in void cutlass::Kernel<cutlass::gemm::kernel::GemmGrouped<cutlass::gemm::threadblock::MmaMultistage<cutlass::gemm::GemmShape<128, 128, 32>, cutlass::transform::threadblock::PredicatedTileAccessIterator<cutlass::MatrixShape<128, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajor, 1, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<32, 128>, 128, cutlass::PitchLinearShape<4, 8>, 8>, cutlass::Array<cutlass::bfloat16_t, 8, false>, false, cutlass::layout::NoPermute>, cutlass::transform::threadblock::RegularTileAccessIterator<cutlass::MatrixShape<128, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<16, 32>, 0, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<32, 128>, 128, cutlass::PitchLinearShape<4, 8>, 8>, 16>, (cutlass::arch::CacheOperation::Kind)1, cutlass::transform::threadblock::PredicatedTileAccessIterator<cutlass::MatrixShape<32, 128>, cutlass::bfloat16_t, cutlass::layout::RowMajor, 0, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<128, 32>, 128, cutlass::PitchLinearShape<8, 4>, 8>, cutlass::Array<cutlass::bfloat16_t, 8, false>, false, cutlass::layout::NoPermute>, cutlass::transform::threadblock::RegularTileAccessIterator<cutlass::MatrixShape<32, 128>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCongruous<16, 64>, 0, cutlass::transform::PitchLinearWarpRakedT–Type for more, q to quit, c to continue without paging–
hreadMap<cutlass::PitchLinearShape<128, 32>, 128, cutlass::PitchLinearShape<8, 4>, 8>, 16>, (cutlass::arch::CacheOperation::Kind)1, float, cutlass::layout::RowMajor, cutlass::gemm::threadblock::MmaPolicy<cutlass::gemm::warp::MmaTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<16, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCongruous<16, 64>, float, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 16>, 32, cutlass::bfloat16_t, cutlass::layout::RowMajor, cutlass::bfloat16_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1> >, 1, false, bool>, cutlass::MatrixShape<0, 0>, cutlass::MatrixShape<0, 0>, 1>, 4, (cutlass::gemm::SharedMemoryClearOption)0, bool>, cutlass::epilogue::threadblock::Epilogue<cutlass::gemm::GemmShape<128, 128, 32>, cutlass::gemm::warp::MmaTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<16, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCongruous<16, 64>, float, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 16>, 32, cutlass::bfloat16_t, cutlass::layout::RowMajor, cutlass::bfloat16_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1> >, 1, false, bool>, 1, cutlass::epilogue::threadblock::PredicatedTileIterator<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<128, 8, 2, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 8, 1, 1, 8>, 128, 8, 16>, cutlass::bfloat16_t, false, cutlass::layout::NoPermute, false>, cutlass::epilogue::warp::FragmentIteratorTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::gemm::GemmShape<16, 8, 16>, float, cutlass::Array<float, 4, true>, cutlass::layout::RowMajor>, cutlass::epilogue::warp::TileIteratorTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::gemm::GemmShape<16, 8, 16>, float, cutlass::layout::RowMajor>, cutlass::epilogue::threadblock::SharedLoadIterator<cutlass::epil–Type for more, q to quit, c to continue without paging–
ogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<128, 8, 2, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 8, 1, 1, 8>, 128, 8, 16>::CompactedThreadMap, float, 32>, cutlass::epilogue::LinearCombination<cutlass::bfloat16_t, 8, float, float, (cutlass::epilogue::ScaleType::Kind)0, (cutlass::FloatRoundStyle)2, cutlass::bfloat16_t>, cutlass::MatrixShape<0, 8>, 1, 1>, cutlass::gemm::threadblock::GemmBatchedIdentityThreadblockSwizzle, (cutlass::gemm::kernel::GroupScheduleMode)0, false> >(cutlass::gemm::kernel::GemmGrouped<cutlass::gemm::threadblock::MmaMultistage<cutlass::gemm::GemmShape<128, 128, 32>, cutlass::transform::threadblock::PredicatedTileAccessIterator<cutlass::MatrixShape<128, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajor, 1, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<32, 128>, 128, cutlass::PitchLinearShape<4, 8>, 8>, cutlass::Array<cutlass::bfloat16_t, 8, false>, false, cutlass::layout::NoPermute>, cutlass::transform::threadblock::RegularTileAccessIterator<cutlass::MatrixShape<128, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<16, 32>, 0, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<32, 128>, 128, cutlass::PitchLinearShape<4, 8>, 8>, 16>, (cutlass::arch::CacheOperation::Kind)1, cutlass::transform::threadblock::PredicatedTileAccessIterator<cutlass::MatrixShape<32, 128>, cutlass::bfloat16_t, cutlass::layout::RowMajor, 0, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<128, 32>, 128, cutlass::PitchLinearShape<8, 4>, 8>, cutlass::Array<cutlass::bfloat16_t, 8, false>, false, cutlass::layout::NoPermute>, cutlass::transform::threadblock::RegularTileAccessIterator<cutlass::MatrixShape<32, 128>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCongruous<16, 64>, 0, cutlass::transform::PitchLinearWarpRakedThreadMap<cutlass::PitchLinearShape<128, 32>, 128, cutlass::PitchLinearShape<8, 4>, 8>, 16>, (cutlass::arch::CacheOperation::Kind)1, float, cutlass::layout::RowMajor, cutlass::gemm::threadblock::MmaPolicy<cutlass::gemm::warp::MmaTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCrosswis–Type for more, q to quit, c to continue without paging–
e<16, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCongruous<16, 64>, float, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 16>, 32, cutlass::bfloat16_t, cutlass::layout::RowMajor, cutlass::bfloat16_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1> >, 1, false, bool>, cutlass::MatrixShape<0, 0>, cutlass::MatrixShape<0, 0>, 1>, 4, (cutlass::gemm::SharedMemoryClearOption)0, bool>, cutlass::epilogue::threadblock::Epilogue<cutlass::gemm::GemmShape<128, 128, 32>, cutlass::gemm::warp::MmaTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<16, 32>, cutlass::bfloat16_t, cutlass::layout::RowMajorTensorOpMultiplicandCongruous<16, 64>, float, cutlass::layout::RowMajor, cutlass::gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 16>, 32, cutlass::bfloat16_t, cutlass::layout::RowMajor, cutlass::bfloat16_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1> >, 1, false, bool>, 1, cutlass::epilogue::threadblock::PredicatedTileIterator<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<128, 8, 2, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 8, 1, 1, 8>, 128, 8, 16>, cutlass::bfloat16_t, false, cutlass::layout::NoPermute, false>, cutlass::epilogue::warp::FragmentIteratorTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::gemm::GemmShape<16, 8, 16>, float, cutlass::Array<float, 4, true>, cutlass::layout::RowMajor>, cutlass::epilogue::warp::TileIteratorTensorOp<cutlass::gemm::GemmShape<64, 64, 32>, cutlass::gemm::GemmShape<16, 8, 16>, float, cutlass::layout::RowMajor>, cutlass::epilogue::threadblock::SharedLoadIterator<cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<cutlass::epilogue::threadblock::OutputTileShape<128, 8, 2, 1, 1>, cutlass::epilogue::threadblock::OutputTileShape<1, 8, 1, 1, 8>, 128, 8, 16>::CompactedThreadMap, float, 32>, cutlass::epilogue::LinearCombination<cutlass::bfloat16_t, 8, float, float, (cutlass::epilogue::ScaleType::Kind)0–Type for more, q to quit, c to continue without paging–
, (cutlass::FloatRoundStyle)2, cutlass::bfloat16_t>, cutlass::MatrixShape<0, 8>, 1, 1>, cutlass::gemm::threadblock::GemmBatchedIdentityThreadblockSwizzle, (cutlass::gemm::kernel::GroupScheduleMode)0, false>::Params)<<<(1,1,1),(128,1,1)>>> ()

ptrblck · August 23, 2024, 6:03pm

Thank you! Would you mind creating an issue on GitHub ideally with a minimal code snippet to reproduce the issue?
I don’t know if we are missing to set the current device inside these CUTLASS kernels properly or what might be causing the memory violation.

NiuMa-1234 · August 26, 2024, 2:49am

Thank you! I put the code snippet in the issue:Cuda memory error when testing GroupedGemm on a multi-Gpu environment · Issue #134441 · pytorch/pytorch · GitHub. And all the three snippets and their original test (in a single gpu environment) are placed at GitHub - NiuMa-1234/Test_of_GroupedGemm_on_multiGpus: This is a test to GroupedGemm on multi-gpu environment which has encountered Cuda error

sharanr · August 26, 2024, 8:24am

Hello, Im using a Nvidia jetson orin nano and im facing compatibility issues with the cuda and pytorch. i have an ubuntu 22.04 , jetpack= 5.1 but when i check the drivers i get.

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
could this be the problem? How to solve this?

ptrblck · August 26, 2024, 2:00pm

Yes, it seems your system is unable to detect the GPU so you might want to fix it first before trying to install and use PyTorch.

sharanr · August 26, 2024, 2:39pm

Hello, thank you for your message. But how do I fix that?. I flashed it already like two times, do I have to flash it once again? If yes ,what details should I pay attention to. If there is some other way could you please let me know.? I’ve been struggling for weeks now. Thanks in advance.

sharanr · August 29, 2024, 6:31am

Hello, is there a way to fix that?