How to use `torch.compile` with CUDA graphs when using gradient activation checkpointing

eqy · May 9, 2023, 5:10am

If you wish to use torch.compile with CUDA graphs, the preferred method to do so would probably be via the option mode="reduce-overhead" which should use CUDA graphs according to the argument documentation:

       mode (str): Can be either "default", "reduce-overhead" or "max-autotune"
        - "default" is the default mode, which is a good balance between performance and overhead
        - "reduce-overhead" is a mode that reduces the overhead of python with CUDA graphs, useful for small batches
        - "max-autotune" is a mode that that leverages Triton based matrix multiplications and convolutions
        - To see the exact configs that each mode sets you can call `torch._inductor.list_mode_options()`

I believe torch.compile should be compatible with DDP, and compatible with FSDP as some issues were recently addressed e.g., [FSDP] `use_orig_params=True` with CPU offloading and Gradient Accumulation: RuntimeError · Issue #98494 · pytorch/pytorch · GitHub.