Torch.compile when /home is a read only filesystem

I’m trying to compile a model using torch.compile on a system when the /home filesystem is not available. I get these types of errors:

  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 2471, in wait
    scope[key] = result.result()
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 2314, in result
    self.future.result()
  File "/work/y07/shared/cirrus-software/miniconda3/22.11.1-1-py310-gpu/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/work/y07/shared/cirrus-software/miniconda3/22.11.1-1-py310-gpu/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
OSError: [Errno 30] Read-only file system: '/home/z04'

Is there a method for providing an alternative filesystem for the compilation to occur on? i.e. setting a TMPDIR environment variable or similar?

You could try setting TRITON_CACHE_DIR to another path and check if the OAI/Triton cache is trying to write to your default cache directory in /home.

1 Like

Good shout, unfortunately it doesn’t fix the issue. It seems to be an issue somewhere in the inductor codecache, i.e.:

  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 2471, in wait
    scope[key] = result.result()
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 2314, in result
    self.future.result()
  File "/work/y07/shared/cirrus-software/miniconda3/22.11.1-1-py310-gpu/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/work/y07/shared/cirrus-software/miniconda3/22.11.1-1-py310-gpu/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
OSError: [Errno 30] Read-only file system: '/home/z04'

OK, try TORCHINDUCTOR_CACHE_DIR next :slight_smile:

1 Like

Unfortunately still the same behaviour.

I don’t know which part of Inductor might still want to write to your /home folder, but @marksaroufim might know.

Which version of PyTorch are you using? Does this error persist on nightlies?

That stacktrace also looks unhelpful because inductor is running async compilation.

You can run your repro with TORCHINDUCTOR_COMPILE_THREADS=1 to force compilation to be synchronous, which should give us a better stacktrace so we can see where inductor is querying the home dir.

The version I’m seeing this error at is 2.2.0+cu118, I’ve not tried anything newer yet but I will do.

Running with a single thread I get this error:

Traceback (most recent call last):
  File "/mnt/lustre/e1000/home/z04/z04/adrianj/MLatScale/Practicals/Practical7-OptimisingPipeline/model_working.py", line 225, in <module>
    aun_lh, aun_th, aun_vh = train_model("UNet", model, train_dataloader, val_dataloader, DiceLoss(), opt, False, num_epochs)
  File "/mnt/lustre/e1000/home/z04/z04/adrianj/MLatScale/Practicals/Practical7-OptimisingPipeline/model_working.py", line 189, in train_model
    outputs = model(data)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/lustre/e1000/home/z04/z04/adrianj/MLatScale/Practicals/Practical7-OptimisingPipeline/model_working.py", line 72, in forward
    enc_ftrs = self.encoder(x)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/lustre/e1000/home/z04/z04/adrianj/MLatScale/Practicals/Practical7-OptimisingPipeline/model_working.py", line 35, in forward
    x = block.to(self.device)(x)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 655, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 727, in _convert_frame
    result = inner_convert(frame, cache_entry, hooks, frame_state)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 383, in _convert_frame_assert
    compiled_product = _compile(
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 646, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper
    r = func(*args, **kwargs)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 562, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
    transformations(instructions, code_options)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 151, in _fn
    return fn(*args, **kwargs)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 527, in transform
    tracer.run()
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2128, in run
    super().run()
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 818, in run
    and self.step()
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 781, in step
    getattr(self, inst.opname)(inst)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2243, in RETURN_VALUE
    self.output.compile_subgraph(
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 919, in compile_subgraph
    self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
  File "/work/y07/shared/cirrus-software/miniconda3/22.11.1-1-py310-gpu/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1087, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper
    r = func(*args, **kwargs)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1159, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1140, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/__init__.py", line 1662, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1168, in compile_fx
    return aot_autograd(
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/backends/common.py", line 55, in compiler_fn
    cg = aot_module_simplified(gm, example_inputs, **kwargs)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 887, in aot_module_simplified
    compiled_fn = create_aot_dispatcher_function(
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper
    r = func(*args, **kwargs)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 600, in create_aot_dispatcher_function
    compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 425, in aot_wrapper_dedupe
    return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 630, in aot_wrapper_synthetic_base
    return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 295, in aot_dispatch_autograd
    compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper
    r = func(*args, **kwargs)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1100, in fw_compiler_base
    return inner_compile(
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/repro/after_aot.py", line 83, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_inductor/debug.py", line 305, in inner
    return fn(*args, **kwargs)
  File "/work/y07/shared/cirrus-software/miniconda3/22.11.1-1-py310-gpu/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 320, in compile_fx_inner
    compiled_graph = fx_codegen_and_compile(
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 550, in fx_codegen_and_compile
    compiled_fn = graph.compile_to_fn()
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1116, in compile_to_fn
    return self.compile_to_module().call
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper
    r = func(*args, **kwargs)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1070, in compile_to_module
    mod = PyCodeCache.load_by_key_path(
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 1892, in load_by_key_path
    exec(code, mod.__dict__, mod.__dict__)
  File "/work/z04/z04/adrianj/cache/aj/cajbxwgdt5jqsxcbzeyzsrdj3bxafh7l2bc7wtmp4sknst7jqzyq.py", line 29, in <module>
    triton_poi_fused_convolution_relu_0 = async_compile.triton('triton_', '''
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 2438, in triton
    return _load_kernel(kernel_name, source_code)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 2291, in _load_kernel
    kernel.precompile()
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_inductor/triton_heuristics.py", line 188, in precompile
    compiled_binary, launcher = self._precompile_config(
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/torch/_inductor/triton_heuristics.py", line 304, in _precompile_config
    binary = triton.compile(
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/triton/compiler/compiler.py", line 489, in compile
    fn_dump_manager = get_dump_manager(
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/triton/runtime/cache.py", line 160, in get_dump_manager
    return __cache_cls(key, dump=True)
  File "/work/z04/z04/adrianj/python-new/lib/python3.10/site-packages/triton/runtime/cache.py", line 56, in __init__
    os.makedirs(self.cache_dir, exist_ok=True)
  File "/work/y07/shared/cirrus-software/miniconda3/22.11.1-1-py310-gpu/lib/python3.10/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/work/y07/shared/cirrus-software/miniconda3/22.11.1-1-py310-gpu/lib/python3.10/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/work/y07/shared/cirrus-software/miniconda3/22.11.1-1-py310-gpu/lib/python3.10/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  [Previous line repeated 2 more times]
  File "/work/y07/shared/cirrus-software/miniconda3/22.11.1-1-py310-gpu/lib/python3.10/os.py", line 225, in makedirs
    mkdir(name, mode)
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
OSError: [Errno 30] Read-only file system: '/home/z04'

This does indeed seem like it would be fixed by setting TRITON_CACHE_DIR - the failure is happening in Triton here triton/python/triton/runtime/cache.py at main · openai/triton · GitHub

Seems like triton it’s coming from this line (dump=True) and it’s due to TRITON_CACHE_DIR not being used in overriding the dump_dir, whose default value is defined in

I guess a simple work-around is to manually edit the source files of triton in your environment until triton team fixes this.