PyTorch NCCL Flight Recorder

The recently published LLaMa-3 paper (https://arxiv.org/pdf/2407.21783) mentions the use of a PyTorch built-in feature called NCCL Flight Recorder as shown below:. On further exploring PyTorch, I did not find any such feature that can record NCCL events.

Question:

  1. Is this feature open sourced?
  2. If yes, is there any documentation or summary of it?
4 Likes

Thanks for reaching out.
Flight recorder will be released as prototype API in 2.5 release with documentation.
cc: @c-p-i-o

Hey Jit! Yes, flight recorder is open sourced and available in the PyTorch branch. We just haven’t broadly made any announcements as we’ve been testing and changing some of the underlying implementation.

We’re writing up some documentation next few days on how to enable it and make use of the generated data to detect stuck jobs. Please stand by. As @gnadathur mentioned, FR will be marked Prototype in upcoming 2.5 release.

Hi Chiraj and Gokul,

Thanks for your responses! Looking forward to being able to use the NCCL Flight Recorder.

1 Like

Could you please tell me which branch the flight recorder is in? thanks

Flight recorder collection portion has been available in 2.4 and in the main branch.
The tool to analyze FR traces is under /tools/flight_recorder/fr_trace.py in the main branch.

1 Like

Also, I’m finishing a tutorial here. Almost done.

2 Likes

Thanks. It’s very helpful. :ghost:

Thanks. its very helpful.

1 Like

Hey @jinqinn, the official tutorial is published here. Let us know if you’ve had any success with Flight Recorder.

The tutorial says I can run torchfrtrace directly when using nightly, but it didn’t work for me.

❯ uv run torchfrtrace -h
Traceback (most recent call last):
  File ".venv/bin/torchfrtrace", line 5, in <module>
    from tools.flight_recorder.fr_trace import main
ModuleNotFoundError: No module named 'tools'
1 Like

Yes, I have the same issue. This is probably because we are using a non-nightly build.

If you install the PyTorch nightly build or build from scratch with USE_DISTRIBUTED=1, you can directly use the following command directly:

torchfrtrace <dump dir containing trace files> [-o <output file>]

EDIT: Actually, I still have the same error even in the nightly build docker.

Oops. It seems related to PYTHONPATH not correctly set for the module.

$pwd
/home/cpio/local/pytorch
$export PYTHONPATH=$(pwd):$PYTHONPATH
$torchfrtrace
Traceback (most recent call last):
  File "/home/cpio/local/b/pytorch-env/bin/torchfrtrace", line 33, in <module>
    sys.exit(load_entry_point('torch', 'console_scripts', 'torchfrtrace')())
  File "/home/cpio/local/pytorch/tools/flight_recorder/fr_trace.py", line 44, in main
    assert args.trace_dir, "Trace directory trace_dir is required"
AssertionError: Trace directory trace_dir is required

Let me see if I can resolve this.

Yes, I had to point PYTHONPATH to pytorch/tools

Thanks for the quick test! I’ll figure out the packaging issue so that you don’t have to set PYTHONPATH.
I quickly tried adding an empty __init__.py file in the tools directory but that didn’t seem to help.

1 Like