Transformer vs RNN for real-time speech separation

I’ve read in Attention is All You Need that Transformers perform better than RNNs (Dual-Path RNN) in speech separation but had ten times the number of parameters. I’ve also read that it could better retain information from early inputs in the input sequence. However, how well does a Transformer network perform in real-time speech separation? Does the number of parameters and the way it deals with the inputs affect its ability to perform real-time speech separation?