PyTorch NCCL Flight Recorder

The recently published LLaMa-3 paper (https://arxiv.org/pdf/2407.21783) mentions the use of a PyTorch built-in feature called NCCL Flight Recorder as shown below:. On further exploring PyTorch, I did not find any such feature that can record NCCL events.

Question:

  1. Is this feature open sourced?
  2. If yes, is there any documentation or summary of it?
4 Likes

Thanks for reaching out.
Flight recorder will be released as prototype API in 2.5 release with documentation.
cc: @c-p-i-o

Hey Jit! Yes, flight recorder is open sourced and available in the PyTorch branch. We just haven’t broadly made any announcements as we’ve been testing and changing some of the underlying implementation.

We’re writing up some documentation next few days on how to enable it and make use of the generated data to detect stuck jobs. Please stand by. As @gnadathur mentioned, FR will be marked Prototype in upcoming 2.5 release.

Hi Chiraj and Gokul,

Thanks for your responses! Looking forward to being able to use the NCCL Flight Recorder.

1 Like

Could you please tell me which branch the flight recorder is in? thanks

Flight recorder collection portion has been available in 2.4 and in the main branch.
The tool to analyze FR traces is under /tools/flight_recorder/fr_trace.py in the main branch.

1 Like

Also, I’m finishing a tutorial here. Almost done.

2 Likes

Thanks. It’s very helpful. :ghost:

Thanks. its very helpful.

1 Like

Hey @jinqinn, the official tutorial is published here. Let us know if you’ve had any success with Flight Recorder.