Coordinate_descent_tuning errors out with torch.AcceleratorError: CUDA error: invalid argument

Thnaks for your response @ptrblck. I tried debugging on cuda-gdb. Interestingly, it breaks on this error which doesn’t look like kernel error but rather memory allocation error:

warning: Cuda Driver error detected: Failed to allocate physical memory
warning: Cuda Driver error detected: Returning 1 (CUDA_ERROR_INVALID_VALUE) from cuMemHostAlloc
[Switching to Thread 0x7fff323ff6c0 (LWP 27593)]
Cuda Runtime API error detected: cudaHostAlloc returned cudaErrorInvalidValue(CUresult=1): invalid argument

At first blush I thought this might be due to pinned memory of length 0 but even after disabling pinning error still occurred.

Below is the part of debug output if it helps (whole output). I have saved the files as well and can post somewhere if needed.

[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code] def benchmark_compiled_module(times=10, repeat=10):
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     from torch._dynamo.testing import rand_strided
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     from torch._inductor.utils import print_performance
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     primals_3 = rand_strided((60, 1024), (1024, 1), device='cuda:0', dtype=torch.int64)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     view = rand_strided((61440, 768), (768, 1), device='cuda:0', dtype=torch.bfloat16)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     mm_default_2 = rand_strided((61440, 50264), (50304, 1), device='cuda:0', dtype=torch.bfloat16)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     amax = rand_strided((61440, 1), (1, 1), device='cuda:0', dtype=torch.float32)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     log = rand_strided((61440, 1), (1, 1), device='cuda:0', dtype=torch.float32)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     convert_element_type_7 = rand_strided((), (), device='cuda:0', dtype=torch.float32)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     permute_3 = rand_strided((50257, 768), (768, 1), device='cuda:0', dtype=torch.bfloat16)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     tangents_1 = rand_strided((), (), device='cuda:0', dtype=torch.float32)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     fn = lambda: call([primals_3, view, mm_default_2, amax, log, convert_element_type_7, permute_3, tangents_1])
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     return print_performance(fn, times=times, repeat=repeat)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code] if __name__ == "__main__":
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     from torch._inductor.wrapper_benchmark import compiled_module_main
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]     compiled_module_main('None', benchmark_compiled_module)
[rank0]:V0911 10:18:41.998000 27347 torch/_inductor/codecache.py:1188] [0/0] [__output_code]
[rank0]:V0911 10:18:42.005000 27347 torch/_inductor/codecache.py:1189] [0/0] [__output_code] Output code written to: /tmp/torchinductor_root/fl/cflma6p5e72qss2pb4zms4ed2cfcttdlkmklhjhb7yj5gouhmcrl.py
[rank0]:W0911 10:18:42.013000 27347 torch/_inductor/debug.py:449] [0/0] model__13_backward_42 debug trace: /data/shitals/devbox/GitHubSrc/nanugpt/torch_compile_debug/run_2025_09_11_10_18_04_782379-pid_27347/torchinductor/model__13_backward_42.14

One other thing I tried was to set this:

export TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1

Hope was that bad config autotune will die in separate process and won’t crash main process but this didn’t worked. Any workaround you can think of would be great!