Zha0q1
(Zhaoqi Zhu)
January 9, 2023, 3:49am
#1
Hi PyTorch community,
We are evaluating distributed training for PT 2.0 with compilation. We noticed that compiling a ~ 1B model will cause the first few steps to be slower and it can take ~10 mins for training to reach stable and full throughput state. I am wondering if a compiled model can be saved as some intermediate format so that re-launching training with the same model will take less time.
Thanks!
smth
January 9, 2023, 9:06am
#2
we do cache the compiles, so that when you run the script again, you shouldn’t be running into recompiles (unless the cache got full).
But we can do a lot more, including having a saveable cache, as well as a moveable (or distributed) cache. At the moment, we don’t have this.
Zha0q1
(Zhaoqi Zhu)
January 9, 2023, 7:02pm
#3
Thanks @smth ! I wonder where is the cache saved to? Also a related question is that we saw this warning when compiling fairseq roberta 1.3B,
config.cache_size_limit (64)
4: function: 'gelu' (/fairseq/fairseq/modules/gelu.py:24)
4: reasons: tensor 'x' size mismatch at index 0. expected 492, actual 440
4: to diagnose recompilation issues, see https://pytorch.org/docs/master/dynamo/troubleshooting.html.
2: [2023-01-07 09:23:10,654] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (64)
2: function: 'gelu' (/fairseq/fairseq/modules/gelu.py:24)
2: reasons: tensor 'x' size mismatch at index 0. expected 368, actual 506
2: to diagnose recompilation issues, see https://pytorch.org/docs/master/dynamo/troubleshooting.html.
Would you share some insights on what this means? Is this an error or a warning? thanks!