Different results with same input but with different evaluation order

I have trained a simple network for sentence classification on 4 classes using
from pytorch_pretrained_bert.modeling import BertForSequenceClassification.

Then I evaluate it on 2 sequences (sentences) placing to DataLoader

  1. only second sentence, and
  2. both sentences.

In the first case the result for the second sentence is
tensor([[-0.3797, 4.1902, -3.0362, -0.9368]])
for the second case the result for the same second sentence is
tensor([[-0.0066, -2.3150, 3.2263, -0.3096]])

The snippet of my code is the following:

def evaluate(logger, model, device, eval_dataloader, eval_label_ids, num_labels, verbose=True):
for input_ids, input_mask, segment_ids, label_ids in eval_dataloader:
input_ids = input_ids.to(device)
input_mask = input_mask.to(device)
segment_ids = segment_ids.to(device)
label_ids = label_ids.to(device)

    with torch.no_grad():
        logits = model(input_ids, segment_ids, input_mask, labels=None)
        print ( str(logits) )

The results are independent from the used device - cpu or gpu.
The eval_batch_size = 1 in both cases.
If I place more sentences to DataLoader then then the results again are varied.
What could be the cause of such behaviour of the model
and how to fix this problem ?