The recently published LLaMa-3 paper (https://arxiv.org/pdf/2407.21783) mentions the use of a PyTorch built-in feature called NCCL Flight Recorder as shown below:. On further exploring PyTorch, I did not find any such feature that can record NCCL events.
Question:
Is this feature open sourced?
If yes, is there any documentation or summary of it?
Hey Jit! Yes, flight recorder is open sourced and available in the PyTorch branch. We just haven’t broadly made any announcements as we’ve been testing and changing some of the underlying implementation.
We’re writing up some documentation next few days on how to enable it and make use of the generated data to detect stuck jobs. Please stand by. As @gnadathur mentioned, FR will be marked Prototype in upcoming 2.5 release.
Flight recorder collection portion has been available in 2.4 and in the main branch.
The tool to analyze FR traces is under /tools/flight_recorder/fr_trace.py in the main branch.
The tutorial says I can run torchfrtrace directly when using nightly, but it didn’t work for me.
❯ uv run torchfrtrace -h
Traceback (most recent call last):
File ".venv/bin/torchfrtrace", line 5, in <module>
from tools.flight_recorder.fr_trace import main
ModuleNotFoundError: No module named 'tools'
Thanks for the quick test! I’ll figure out the packaging issue so that you don’t have to set PYTHONPATH.
I quickly tried adding an empty __init__.py file in the tools directory but that didn’t seem to help.