Tensorboard and HTA not loading TRACE FILES

Ilex00para · August 7, 2024, 4:28pm

Hello,

I want to trace my model. Started with the Profiling PyTorch Tutorials:

1 Step: trace file is saved in the correct folder.

with torch.profiler.profile(
        schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1), #schedule the profiler to start after 1 step, warmup for 1 step, run for 3 steps and repeat 1 time
        on_trace_ready= torch.profiler.tensorboard_trace_handler('./traces'), #save the trace to tensorboard by using a tracer object
        record_shapes=True,
        profile_memory=True,
        with_stack=True
) as prof:
    for step, batch_data in enumerate(train_loader):
        prof.step()  # Need to call this at each step to notify profiler of steps' boundary.
        if step >= 1 + 1 + 3:
            break
        train(batch_data)

2 Step (ERROR 1): Displaying the trace file (json) in tensorboard gives the first error. I tried to look in forums which for similar error and found but it did not solve my problem. I tried several paths and can exclude this cause.

3 Step (ERROR 2): Reading the file (json) with holistic trace analysis.

2024-08-07 18:05:24,170 - hta - trace.py:L389 - INFO - C:/????/?????/??????/PyTorch/log/resnet18/
2024-08-07 18:05:24,307 - hta - trace_file.py:L61 - ERROR - If the trace file does not have the rank specified in it, then add the following snippet key to the json files to use HTA; "distributedInfo": {"rank": 0}. If there are multiple traces files, then each file should have a unique rank value.
2024-08-07 18:05:24,447 - hta - trace_file.py:L61 - ERROR - If the trace file does not have the rank specified in it, then add the following snippet key to the json files to use HTA; "distributedInfo": {"rank": 0}. If there are multiple traces files, then each file should have a unique rank value.
2024-08-07 18:05:24,448 - hta - trace_file.py:L92 - WARNING - There is no item in the rank to trace file map.
2024-08-07 18:05:24,448 - hta - trace.py:L535 - INFO - ranks=[]
2024-08-07 18:05:24,449 - hta - trace.py:L541 - ERROR - The list of ranks to be parsed is empty.

Questions:

Why is the rank missing? Is this the right file?

image1287×340 8.77 KB
I saw multiple notation that tensorboard is deprecated and hta is now prefered? Is there a difference between their trace files? Do I understand something wrong? (In the hta documentation it refers to the same code I have used.

Thank you in advance.

Ilex00para · August 9, 2024, 12:09pm

I solved the problem with HTA.

The case is that HTA is more for distributed jobs starting with a simple example is not covered. The rank needs to be specified manually if it is only run on one GPU: ERROR - If the trace file does not have the rank specified in it, then add the following snippet key to the json files to use HTA; "distributedInfo": {"rank": 0}. If there are multiple traces files, then each file should have a unique rank value.. Which makes it possible to read the file.

github.com/facebookresearch/HolisticTraceAnalysis

HTA expects rank to be specified in a trace file

opened 06:03PM - 21 Feb 24 UTC

closed 03:07AM - 22 Feb 24 UTC

lishen

question

### 🐛 Describe the bug tried to run a simple example to use HTA for the first t…ime but ran into error when reading the trace analysis. here's the error message: > 2024-02-21 12:43:20,242 - hta - trace.py:L389 - INFO - path_to/traces-1 2024-02-21 12:43:20,283 - hta - trace_file.py:L61 - ERROR - If the trace file does not have the rank specified in it, then add the following snippet key to the json files to use HTA; "distributedInfo": {"rank": 0}. If there are multiple traces files, then each file should have a unique rank value. 2024-02-21 12:43:20,283 - hta - trace_file.py:L92 - WARNING - There is no item in the rank to trace file map. 2024-02-21 12:43:20,284 - hta - trace.py:L535 - INFO - ranks=[] 2024-02-21 12:43:20,285 - hta - trace.py:L541 - ERROR - The list of ranks to be parsed is empty. ### Steps to reproduce first, i created a trace file: ```python import torch from torch.profiler import ( profile, schedule, tensorboard_trace_handler, ProfilerActivity) import torchvision.models as models model = models.resnet18().cuda() inputs = torch.randn(5, 3, 224, 224).cuda() tracing_schedule = schedule(skip_first=5, wait=5, warmup=2, active=2, repeat=1) trace_handler = tensorboard_trace_handler(dir_name="./traces-1/", use_gzip=True) with profile( activities = [ProfilerActivity.CPU, ProfilerActivity.CUDA], schedule = tracing_schedule, on_trace_ready = trace_handler, profile_memory = True, record_shapes = True, with_stack = True ) as prof: for idx in range(25): model(inputs) prof.step() ``` then i tried to read the trace: ```python from hta.trace_analysis import TraceAnalysis trace_dir = "./traces-1/" analyzer = TraceAnalysis(trace_dir=trace_dir) ``` ### Expected behavior `TraceAnalysis` reads the trace file without error so that i can do something cool. ### Environment OS version: CentOS Linux release 7.9.2009 Python version: 3.9.1 PyTorch version: 2.2.0 torch-tb-profiler: 0.4.3 HTA version: 0.2.0 How did you installed HTA (pip, source): pip ### Additional Info _No response_

github.com/facebookresearch/HolisticTraceAnalysis

single GPU support

opened 11:17AM - 01 Feb 24 UTC

closed 04:24PM - 01 Feb 24 UTC

AwePhD

help wanted question needs triage

### What is your question? My question is about the direction of the pytorch pr…ofiling. To summarize, profiler trace (from meta's kineto) was (and still is) collected by pytorch profiler. Then these traces were input to tensorboard. tensorboard had some flaws - its constrained usages, cannot be scripted to process traces manually - and got deprecated. Then pytorch made an announcement ([1 year ago](https://pytorch.org/blog/trace-analysis-for-masses/)) of opening their traces analysis tool (HTA) source. Originally, it was used for LLM models and it seems to be heavily oriented to distributed profiling - all the performance problem generated by the introduction of distributed computation. That's great and promising, I mean it. Although, a lot of researchers and deep learning enthusiasts still use a single GPU during their experiments and activities. Since HTA becomes the canonical way to work with the Kineto traces, do you plan to support traces from "not distributed" computation? Namely, without a workaround that implies to add the rank info in the `json` trace file. Thanks for your work and your time reading this (lenghty) question. ### Code I am fully aware of the [`hta.common.trace_file:create_rank_to_trace_dict`](https://github.com/facebookresearch/HolisticTraceAnalysis/blob/007e331de8b211fbc5f1c8575486d56d5cfbbb9e/hta/common/trace_file.py#L62C20-L65C18). ### What have you tried? _No response_ ### Environment _No response_

Still I am always a reluctant to change something manually in a computer created file and follow up code is not working if the job is not distributed.

PS: This note in the " PyTorch Profiler With TensorBoard" is super confusing

Note
TensorBoard Plugin support has been deprecated, so some of these functions may not work as previously. Please take a look at the replacement, HTA.