I don't know how to use GRU Moves but does not learn

I don’t know how to use GRU
It moves but doesn’t learn
This is the first time I’ve used an RNN layer in pytorch, so I think this is the cause.
Please tell me if there is not enough information I will add it

code
#Definition
self.rnn=nn.GRU(self.hidden_size1,gru_hidden)
self.norm=nn.LayerNorm(gru_hidden)
self.Relu=swish(0.7)

#runtime
x=self.Linear(x)
x,hidden=self.rnn(x,hidden)
onry_x=self.Relu(self.norm(x))

Causes i can think of
1
Insufficient understanding of gru and proper initialization (for example, he、Set bias to 0)
I don’t even know how to access weight bias

2
There is a tanh in the GRU (in the official description)
3
Something is missing

I’m having trouble understanding your issue. Could you post some executable code?

If you put all of them, it will be long, so put only a part

loss=torch.sum(output.permute(1,2,0),dim=2)-torch.sum(targets.detach().permute(1,2,0),dim=2)
    
optimizer.zero_grad()
loss=loss__(loss)#(loss**2)/2
loss.backward()
torch.nn.utils.clip_grad_norm_(mainQN.parameters(),1,2)
optimizer.step()
scheduler.step()
self.tortal_losses+=loss.clone().detach().to('cpu')

target.size=time,batch,output
output.size=time,batch,output

input.size=time,batch,20

model

  Definitionself.rnn = nn.GRU(self.hidden_​​size1、gru_hidden)
  self.norm = nn.LayerNorm(gru_hidden)
  self.Relu = swish(0.7)
  
  #runtime
  x = self.Linear(x)
  x、hidden = self.rnn(x、hidden)
  onry_x = self.Relu(self.norm(x)

I’m doing something a little different after GRU. Is this a problem?
There should be no problem in itself as it is from the code before the introduction of GRU

   L = self.L(x).view(bacth_size*timee, self.num_outputs,self.num_outputs)
    tril_mask = torch.tril(torch.ones(self.num_outputs,self.num_outputs), diagonal=-1).unsqueeze(0).to("cuda:0")
    diag_mask = torch.diag(torch.diag(torch.ones(self.num_outputs,self.num_outputs))).unsqueeze(0).to("cuda:0")
    L = L * tril_mask.expand_as(L) + torch.exp(L) * diag_mask.expand_as(L)
    P = torch.bmm(L, L.transpose(2, 1))

In the explanation of pytorch, it looks like this

   \begin{array}{ll}
    r = \sigma(W_{ir} x + b_{ir} + W_{hr} h + b_{hr}) \\
    z = \sigma(W_{iz} x + b_{iz} + W_{hz} h + b_{hz}) \\
    n = \tanh(W_{in} x + b_{in} + r * (W_{hn} h + b_{hn})) \\
    h' = (1 - z) * n + z * h
    \end{array}

I don’t want to use tanh
Is there a way to change to Relu or swish?