Does `torch.nn.Transformer` uses caching for autoregressive decoding?

seewoo5 · December 21, 2020, 8:49am

This blog post says that we can make transformer much faster while decoding by caching the previous inputs, since there’s no need to re-compute all the same things we already computed. When I read the implementation of the torch.nn.Transformer code, it doesn’t seem to have such feature (at least for 1.7.0). If I’m right, is there any plan to implement such feature, or is it already implemented by someone and will be released in 1.8.0? Thanks!

yijiang · December 10, 2021, 10:29pm

As far as I know, nn.Transformer remains unchanged since its release. Nonetheless, I found a post that might answer your question. Or, another option is to dive into the source code of huggingface transformer to learn how they implement the cache and other optimization tricks (which is tough but highly worth!!). Moreover, after implementing these tricks, we can pass a custom_decoder also custom_encoder as args to instantiate a nn.Transformer.