What is the "persistent algorithm" in GRU and LSTM?

LSTM and GRU docs have the following notes:

“If the following conditions are satisfied: 1) cudnn is enabled, 2) input data is on the GPU 3) input data has dtype torch.float16 4) V100 GPU is used, 5) input data is not in PackedSequence format persistent algorithm can be selected to improve performance.”

What is persistent algorithm and how can I select it?

You cannot select it manually and it will automatically be used, if the specified conditions are met.
This GTC talk gives some information on persistent kernels, which basically try to avoid memory “movement” and try to reuse values once they are loaded.

Sorry to re-open this thread, but wanted to ask some questions to check if someone figured it out already.

I was testing LSTMs and GRUs on a A100 and on a GB200. The A100 seems to use the persistent algorithm for some combinations of batch sizes and hidden dim sizes; the GB200 doesn’t seem to do it.

Apart from that, it’s not clear to me what are the rules triggering the usage of the persistent algorithm. e.g. when training both on a A100, a unidirectional LSTM with sequence lenght 2048 and batch size 16 uses the persistent kernel, whereas a bidirectional LSTM with sequence lenght 2048 and batch size 8 does not.

Does anyone know the boundaries at which the persistent algorithm will be chosen instead of the standard one?