I want to trace my model. Started with the Profiling PyTorch Tutorials:
1 Step: trace file is saved in the correct folder.
with torch.profiler.profile(
schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1), #schedule the profiler to start after 1 step, warmup for 1 step, run for 3 steps and repeat 1 time
on_trace_ready= torch.profiler.tensorboard_trace_handler('./traces'), #save the trace to tensorboard by using a tracer object
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for step, batch_data in enumerate(train_loader):
prof.step() # Need to call this at each step to notify profiler of steps' boundary.
if step >= 1 + 1 + 3:
break
train(batch_data)
2 Step (ERROR 1): Displaying the trace file (json) in tensorboard gives the first error. I tried to look in forums which for similar error and found but it did not solve my problem. I tried several paths and can exclude this cause.
3 Step (ERROR 2): Reading the file (json) with holistic trace analysis.
2024-08-07 18:05:24,170 - hta - trace.py:L389 - INFO - C:/????/?????/??????/PyTorch/log/resnet18/
2024-08-07 18:05:24,307 - hta - trace_file.py:L61 - ERROR - If the trace file does not have the rank specified in it, then add the following snippet key to the json files to use HTA; "distributedInfo": {"rank": 0}. If there are multiple traces files, then each file should have a unique rank value.
2024-08-07 18:05:24,447 - hta - trace_file.py:L61 - ERROR - If the trace file does not have the rank specified in it, then add the following snippet key to the json files to use HTA; "distributedInfo": {"rank": 0}. If there are multiple traces files, then each file should have a unique rank value.
2024-08-07 18:05:24,448 - hta - trace_file.py:L92 - WARNING - There is no item in the rank to trace file map.
2024-08-07 18:05:24,448 - hta - trace.py:L535 - INFO - ranks=[]
2024-08-07 18:05:24,449 - hta - trace.py:L541 - ERROR - The list of ranks to be parsed is empty.
I saw multiple notation that tensorboard is deprecated and hta is now prefered? Is there a difference between their trace files? Do I understand something wrong? (In the hta documentation it refers to the same code I have used.
The case is that HTA is more for distributed jobs starting with a simple example is not covered. The rank needs to be specified manually if it is only run on one GPU: ERROR - If the trace file does not have the rank specified in it, then add the following snippet key to the json files to use HTA; "distributedInfo": {"rank": 0}. If there are multiple traces files, then each file should have a unique rank value.. Which makes it possible to read the file.
Still I am always a reluctant to change something manually in a computer created file and follow up code is not working if the job is not distributed.
PS: This note in the " PyTorch Profiler With TensorBoard" is super confusing
Note
TensorBoard Plugin support has been deprecated, so some of these functions may not work as previously. Please take a look at the replacement, HTA.