The recently published LLaMa-3 paper (https://arxiv.org/pdf/2407.21783) mentions the use of a PyTorch built-in feature called NCCL Flight Recorder as shown below:. On further exploring PyTorch, I did not find any such feature that can record NCCL events.
Question:
Is this feature open sourced?
If yes, is there any documentation or summary of it?
Hey Jit! Yes, flight recorder is open sourced and available in the PyTorch branch. We just haven’t broadly made any announcements as we’ve been testing and changing some of the underlying implementation.
We’re writing up some documentation next few days on how to enable it and make use of the generated data to detect stuck jobs. Please stand by. As @gnadathur mentioned, FR will be marked Prototype in upcoming 2.5 release.
Flight recorder collection portion has been available in 2.4 and in the main branch.
The tool to analyze FR traces is under /tools/flight_recorder/fr_trace.py in the main branch.