I have a few questions regarding the performance impact of enabling the environment variables
TORCH_DISTRIBUTED_DEBUG=DETAIL. I would greatly appreciate any insights that anyone can provide on the following points:
- Code Implementation: Does anyone have information on where these variables should be set in the code? Specifically, I would like to know if enabling these variables has any impact on single-GPU performance.
- Scaling to Multiple GPUs: As we scale our system to accommodate a large number of GPUs, potentially hundreds, I’m curious if enabling these variables leads to increased overhead or performance degradation?
- Statistics and Iteration Values: The statistics reported when these variables are enabled are typically averaged. Is there a way to access the values for each iteration? We are particularly interested in analyzing potential slow nodes, and having access to individual GPU values could help uncover any hidden issues that might be masked by averaging.
Thank you in advance to anyone who can provide insights or information on these topics!