The difference between training and testing mode

When I train the model, I split the data into train/val/test parts:

from torchtext import data
train_iter, val_iter, test_iter = data.Iterator.splits(
    (train_data, val_data, test_data),
    batch_sizes=(64, 640, 640),
    device=args.device, repeat=args.repeat

I also evaluate the model performance on val data while training, and the performace is extremely good (95%+). I remember to do model.eval() and thenmodel.train() back.

Then I save the model using

I load the model and test it on slice_train_data and slice_val_data.

slice_train_examples = train_examples[:6400]
slice_train_data = DS(*fields, examples=slice_train_examples) # DS is a class inherited from
slice_val_examples = val_examples[:6400]
slice_val_data = DS(*fields, examples=slice_val_examples)

Then I call model.compute_loss(data_iter).

def compute_loss(self, data_iter):  # do validation
    corrects, avg_loss = 0, 0
    steps = 0
    for i,batch in enumerate(data_iter):
        loss, pos_n_energy, neg_n_energy = self.compute_batch_loss(batch)
        avg_loss +=[0]
        corrects += self.corrects(pos_n_energy, neg_n_energy)
        steps += 1
    size = len(data_iter.dataset)  # total size
    avg_loss = avg_loss / size
    accuracy = 100.0 * corrects / size
    return avg_loss, accuracy

Here comes the confusing part.
I notice that the train set and valid set are differently treated on the

train_iter, val_iter = data.Iterator.splits(
    (slice_train_data, slice_val_data ),
    batch_sizes=(640, 640),
    device=0, repeat=False

Then the accuracy of model.compute_loss(train_iter) is 53.4375% and the accuracy of model.compute_loss(val_iter) is 92.3356%.
However if I do following:

val_iter, train_iter = data.Iterator.splits(
    (slice_val_data, slice_train_data ),
    batch_sizes=(640, 640),
    device=0, repeat=False

The accuarcy of model.compute_loss(val_iter) is 49.8125%, and the accuracy of model.compute_loss(train_iter) is 83.3438%.

Why is the performance of train_iter and val_iter so different from each other?
What is the correct way to do splits while training and testing?

I saw some examples that they usually set the batch_size of validation set as the length of the data, say batch_sizes=(xx, len(slice_val_data)). However, my validation set is too big (28,000+) to feed into my GPU. I set a smaller number as the batch_size of validation set. Does this matter?

BTW, I used cnn+BatchNorm model. Does BatchNorm matter in this case?

Thank you very much.