Best practice for modifying intermediate activations of CNNs

For my project I plan to add RNNs between every CNN block. I can just hack any CNN network, but this will take a lot of time. I wish to take advantage of the pretrained torch vision models, and hopefully do so with minimal hacking. Preferably I can reuse the code for all the provided pretrained models. This might be difficult though, since the models provided by torch vision all have slightly different construction.

I looked around and the closest thing that I found are forward hooks. I can register the RNNs as hooks after each layer’s output activation. However it was noted that it introduces global states and should only be used for debugging and profiling. What exactly does this mean? Would this be a recommended practice? If not, what are your suggested alternatives?