More info

I have a few questions regarding the performance impact of enabling the environment variables TORCH_CPP_LOG_LEVEL=INFO and TORCH_DISTRIBUTED_DEBUG=DETAIL. I would greatly appreciate any insights that anyone can provide on the following points:

  1. Code Implementation: Does anyone have information on where these variables should be set in the code? Specifically, I would like to know if enabling these variables has any impact on single-GPU performance.
  2. Scaling to Multiple GPUs: As we scale our system to accommodate a large number of GPUs, potentially hundreds, I’m curious if enabling these variables leads to increased overhead or performance degradation?
  3. Statistics and Iteration Values: The statistics reported when these variables are enabled are typically averaged. Is there a way to access the values for each iteration? We are particularly interested in analyzing potential slow nodes, and having access to individual GPU values could help uncover any hidden issues that might be masked by averaging.

Thank you in advance to anyone who can provide insights or information on these topics!

  1. These are environment variables, which should be exported in your terminal. Setting them in your script could also work, but I won’t recommend it as env variables are often set too late in the script directly. Yes, printing to the terminal could slow down your application.
  2. Same as 1 since you would be printing even more.
  3. I don’t fully understand the question and which stats you are interested in as these env variables are used for debugging.

Thank you for your reply. I apologize for any confusion caused by my previous question.

Regarding my first question, I am interested in understanding the specific locations within the PyTorch source code where these variables are set. By gaining this insight, I hope to obtain a deeper understanding of their impact on performance.

For the third question, when utilizing these environment variables, we can extract valuable information, such as the average forward compute time. This data is useful in analyzing and resolving performance issues. Our objective is to gather more detailed information to effectively address the performance problem at hand.

If you have any insights or suggestions regarding these questions, I eagerly await your response. Thank you in advance.

You can search the source code and will find that e.g. TORCH_DISTRIBUTED_DEBUG is read here.