If you wish to use torch.compile with CUDA graphs, the preferred method to do so would probably be via the option mode="reduce-overhead" which should use CUDA graphs according to the argument documentation:
mode (str): Can be either "default", "reduce-overhead" or "max-autotune"
- "default" is the default mode, which is a good balance between performance and overhead
- "reduce-overhead" is a mode that reduces the overhead of python with CUDA graphs, useful for small batches
- "max-autotune" is a mode that that leverages Triton based matrix multiplications and convolutions
- To see the exact configs that each mode sets you can call `torch._inductor.list_mode_options()`
I believe torch.compile should be compatible with DDP, and compatible with FSDP as some issues were recently addressed e.g., [FSDP] `use_orig_params=True` with CPU offloading and Gradient Accumulation: RuntimeError · Issue #98494 · pytorch/pytorch · GitHub.