Hi,
My training crashed because I ran out of disk space. After further inspection it seems like there are hundreds of GBs stored in my /tmp/torchinductor_azureuser/triton
For context — I am training my model with torch.compile and DDP
What is being stored there? Is there any way for me to prevent running out of storage in long (weeks) training runs?
TIA