"torch._dynamo hit config.cache_size_limit (64)"

vgoklani · July 10, 2023, 7:41pm

Hey there,

I keep hitting this warning, followed by a long set of compiler messages:

[2023-07-10 15:24:52,962] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (64) function: 'forward' (/usr/local/lib/python3.10/dist-packages/transformers/models/t5/modeling_t5.py:452) reasons:  ___check_obj_id(L['self'], 139679724322528) to diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html.

I checked the pytorch docs:

https://pytorch.org/docs/stable/dynamo/troubleshooting.html#excessive-recompilation

and saw this:

torch._dynamo.config.cache_size_limit = <your desired cache limit>

Is there any guidance for how to choose the cache limit? Also, not sure how to set this attribute as it’s unable to reference _dynamo directly (?)

Not sure if it matters, but the model being trained was a t5-base via huggingface’s trainer on an NVIDIA A6000 Ada card (48GB). The installation is straight from this docker-image: nvcr.io/nvidia/pytorch:23.06-py3

Thanks!

marksaroufim · July 10, 2023, 8:52pm

It’s likely that your model is just graph breaking because of some changes per iteration, try nighlies, try dynamic=True, try making sure your inputs are padded to the exact same shape and finally try the recompilation profiler from teh troubleshooting guide if neither of those work. And if that doesnt help either it’s probably something model specific and it might be then helpful to open up an issue with repro on github