Trying to understand the use of ReLu in a LSTM Network

I am currently trying to optimize a simple NN with Optuna. Besides the Learning Rate, Batch Size etc. I want to optimize different network architecture as well. So up until now I optimize the number of LSTM layers, aswell as the number of Dense layers. But now I was thinking about activation functions. Bare in mind I am very new to NN… but I am constently reading about ReLu and Leaky ReLu and I know LSTM uses tanh and sigmoid internally. So first I thought maybe the internal tanh gets switched with a ReLu function but I think I got that wrong right?

What I have seen is that the nn.ReLu() gets applied in between Layers, so I would think it would only make sense to apply it in between my Dense Layers?

Sorry for the Noob Question. I am having trouble understanding these things as they are so basic that they are nowhere discussed.

I like explanation from here:

Ah thats interesting. So that means I should not use any activation function in my whole model? I am still a bit confused since I have seen so many models use ReLu.

If you have linear layers beside LSTM blocks, adding some activation function between them usually works. You can test both options, ofc. In my experience ReLU is really efficient for generalization.