How will randomness of Gumbel-Softmax be reflected during evaluation?

I am very new to reinforcement learning and gumbel-softmax.
What I am currently doing is to train an LSTM network to generate a policy to achieve good tradeoff between two tasks: i.e. final accuracy and total computation steps.

Ideally, I hope to achieve the highest final accuracy, but also with as few steps as possible. My initial experiments look promising:
(1) If I only optimize the “final accuracy loss”, the system learns to use the most steps and achieved the highest accuracy like 99%
(2) If I only optimize the “computation steps loss”, the system learns to use 0 steps and achieved the lowest accuracy like 1%
Now I combined this 2 losses in a naive way, and the training curves experienced a lot of fluctuation.
I can identify an epoch that achieved a great tradeoff (i.e. 91% accuracy with very few steps). But after I downloaded this checkpoint and reloaded it back to pytorch, the result is actually not that good, because:
(1) With torch.manual_seed(0), if I tested this checkpoint, it always gave me 83%, instead of 91%.
(2) If I repeat the validation epoch multiple times with this same checkpoint, the accuracy changed like this 0.83, 0.67, 0.59, 0.63 …
(3) If I insert torch.manual_seed(0) to the beginning of each iteration of this for loop of validation, the accuracy became 0.62, 0.62, 0.62, 0.62, 0.62 … fixed to this value.

Could someone give me some comments on the above observation? Is this something normal? Did something look wrong or strange in my experiments? Thanks in advance.