Hi PyTorch community,
We are evaluating distributed training for PT 2.0 with compilation. We noticed that compiling a ~ 1B model will cause the first few steps to be slower and it can take ~10 mins for training to reach stable and full throughput state. I am wondering if a compiled model can be saved as some intermediate format so that re-launching training with the same model will take less time.
we do cache the compiles, so that when you run the script again, you shouldn’t be running into recompiles (unless the cache got full).
But we can do a lot more, including having a saveable cache, as well as a moveable (or distributed) cache. At the moment, we don’t have this.
Thanks @smth ! I wonder where is the cache saved to? Also a related question is that we saw this warning when compiling fairseq roberta 1.3B,
4: function: 'gelu' (/fairseq/fairseq/modules/gelu.py:24)
4: reasons: tensor 'x' size mismatch at index 0. expected 492, actual 440
4: to diagnose recompilation issues, see https://pytorch.org/docs/master/dynamo/troubleshooting.html.
2: [2023-01-07 09:23:10,654] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (64)
2: function: 'gelu' (/fairseq/fairseq/modules/gelu.py:24)
2: reasons: tensor 'x' size mismatch at index 0. expected 368, actual 506
2: to diagnose recompilation issues, see https://pytorch.org/docs/master/dynamo/troubleshooting.html.
Would you share some insights on what this means? Is this an error or a warning? thanks!
As a workaround for the currently missing savable cache function, would it be possible to do a manual workaround by pickling that cache? As Zhaoqi asks, where is the cache kept currently? Is it of the class fields?
Also, this is potentially out-of-scope in this question, but I would be interested in a “save cache” feature for TorchScript too. It there anything like that already implemented?
@GuillaumeTong did you get solution for above problem? are you able to save the compiled model?