PT 2.0 - Are compiled models savable

Zha0q1 · January 9, 2023, 3:49am

Hi PyTorch community,

We are evaluating distributed training for PT 2.0 with compilation. We noticed that compiling a ~ 1B model will cause the first few steps to be slower and it can take ~10 mins for training to reach stable and full throughput state. I am wondering if a compiled model can be saved as some intermediate format so that re-launching training with the same model will take less time.

Thanks!

smth · January 9, 2023, 9:06am

we do cache the compiles, so that when you run the script again, you shouldn’t be running into recompiles (unless the cache got full).

But we can do a lot more, including having a saveable cache, as well as a moveable (or distributed) cache. At the moment, we don’t have this.

Zha0q1 · January 9, 2023, 7:02pm

Thanks @smth ! I wonder where is the cache saved to? Also a related question is that we saw this warning when compiling fairseq roberta 1.3B,

config.cache_size_limit (64)
 4:    function: 'gelu' (/fairseq/fairseq/modules/gelu.py:24)
 4:    reasons:  tensor 'x' size mismatch at index 0. expected 492, actual 440
 4: to diagnose recompilation issues, see https://pytorch.org/docs/master/dynamo/troubleshooting.html.
 2: [2023-01-07 09:23:10,654] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (64)
 2:    function: 'gelu' (/fairseq/fairseq/modules/gelu.py:24)
 2:    reasons:  tensor 'x' size mismatch at index 0. expected 368, actual 506
 2: to diagnose recompilation issues, see https://pytorch.org/docs/master/dynamo/troubleshooting.html.

Would you share some insights on what this means? Is this an error or a warning? thanks!

GuillaumeTong · March 9, 2023, 3:01am

As a workaround for the currently missing savable cache function, would it be possible to do a manual workaround by pickling that cache? As Zhaoqi asks, where is the cache kept currently? Is it of the class fields?

Also, this is potentially out-of-scope in this question, but I would be interested in a “save cache” feature for TorchScript too. It there anything like that already implemented?

marksaroufim · March 9, 2023, 8:34pm

@GuillaumeTong I also have this requirement so we can continue the discussion there Inductor codecache not saving compilation time in multiprocess env · Issue #96152 · pytorch/pytorch · GitHub

Nirmai · June 21, 2023, 7:07am

@GuillaumeTong did you get solution for above problem? are you able to save the compiled model?

marksaroufim · June 21, 2023, 5:33pm

@Nirmai @GuillaumeTong this is more early days but there are ways of loading the cache in python. I have a POC here AOT Inductor load in python by msaroufim · Pull Request #103281 · pytorch/pytorch · GitHub