When should one use `torch.nn.RNNCell` instead of `torch.nn.RNN`?

I saw some applications use torch.nn.RNNCell to build stacked RNN layers. Will this manually built block has performance issue compared to torch.nn.RNN?

RNN is usually more performant as it uses cuDNN’s implementation under the hood. RNNCell, on the other hand, is more flexible.