GRU calculations for gates and output in testing mode


I tried in many ways but I could not get the solution. I am implementing GRU layer in a model in c.
I have extracted weights and biases. Can you give me the equations to compute the gates and outputs.

I am using hidden size = 32
input_size =32
num_layers = 1
batch_first = true

I followed the equations at at GRU section.
My expected output is 151x32
Output is matched with just 1x32 rest 150x32 is not matching.

I am not able to know, what is called the time step in implementation time.
If possible suggest me any document about implementing GRU in testing mode.

GRU (forward) behaves the the same (and as in the documentation you linked) between training and testing. The time step is the first dimension in the 3d input and output seq_len x batch x features. Maybe considering the (unit, use unsqueeze) batch size explicitly can help you (just in case you accidentally fed a single-time-step, 151 batch size into the GRU when you wanted 151 time step, batch size 1.

Best regards


Thanks for the quick reply, can you fix my issue one more step. I am giving my code which I stuck for few days, please correct me.
Input dimension is 1x151x32
k = crnn_model.rnn.weight_ih_l0
p  = crnn_model.rnn.weight_hh_l0
m  = crnn_model.rnn.bias_hh_l0
l  = crnn_model.rnn.bias_ih_l0

As mentioned, gru layer has 1 layer, I am feeding zeros as state. And updating with the new state after 
calculating output. 

This is the code giving me wrong results as i mentioned. Please solve my issue to go ahead, I am stuck here for days

ht = np.zeros(32)
for j in range(151):
    for i in range(32):
        sum_r = np.sum(k[i]*input[0][j]) + l[i+0]  +  m[i+0] + (np.sum(p[i]))*ht[i]
        r  =   sum_r 
        r  = 1 / (1 + math.exp(-r))
        sum_z = np.sum(k[i+32]*input[0][j]) + l[i+32] +  m[i+32] + np.sum(p[i+32])*ht[i]
        z  =   sum_z
        z  = 1 / (1 + math.exp(-z))
        n = np.sum(k[i+64]*input[0][j]) + l[i+64] +(m[i+64]+np.sum(p[i+64])*ht[i])*r
        n = math.tanh(n)
        out   = (1-z)*n +  z*(ht[i])
        ht_past[i] =  out        
    ht = ht_past 	

My Input to the GRU is 1x151x32.
Hope I get to go forward.

  • I’m not sure why this is a libtorch question?
  • I would suggest rewriting it in PyTorch first using torch.nn.functional (with linear, sigmoid, and tanh). If you have to, you can refine that afterwards.
  • np.sum seems wrong, as you want to do reduction over one axis only.

Best regards


May be I expressed the question in a wrong way. It is not libtorch. just checking the outputs by loading the model and extraction weights in python itself.

  • I converted them into numpy, then only I am using np.add. I skipped those steps in post.

  • Is that procedure correct that I followed in the view of time steps. Taking 32 features at a time.

I used this to take care of axis

  • out = out.permute(0, 2, 1, 3)
  • out = out.view(out.shape[0], out.shape[1], out.shape[2])

I am using batch_first is true
so My input dimension is

  • batch x seq_len x features

Well, yeah, the summation for the matmuls are still wrong.

Actually, I wanted to know which operation is going wrong.can you point me, which matmul calculation is wrong.

  1. Should not use numpy.sum()? Is this the one…

Sorry, still i am not getting which line is wrong.:pensive:

Do print your the result of your sum and then look what’s wrong.

Ok, those are multiplications, right?(except ht) Not point wise multiplications.


Thank you for your answers. Problem is resolved. But precision is missing by  0.003. It is not desirable, right. Can you give me suggestions on this.

Depending on the size of the arguments, this can be an effect of machine precision (~relative 1e-7 for float, ~1e-14-15 for double).


   Yes, the precision is (~relative 1e-7 ) for the first time step. For the second time step precision is going at (~relative 1e-3) .

Rest of the time steps also precision loss is getting accumulated as the current time step is depend on the previous time step.

Is there any way that I can solve the issue?


Other than using double?

I tried in C++ as long, long double and float also. May be if 1st time step output is perfectly match without any precision loss, it works. But i am not able to know how to do that. Because every time steps ’ output is depend on previous time steps. Is it ?

shall I send you the code in c which i have implemented?

To be perfectly honest, I think we’ve strayed quite far from using libtorch already and for me I’d not take this any further.

Best regards and all the best for your project


Problem resolved. Thank you very much for your answers and they helped me to get to know about time steps.