Training forward pass faster than evaluation forward pass

I am trying to measure the runtime of the forward pass of ResNet-50 during training and during evaluation. I am running experiments on a 64-core CPU and no GPU is used. Here is the code I used.

import argparse
import time

import torch
import torch.optim as optim
import torchvision
import torchvision.datasets as datasets
import torchvision.transforms as transforms


parser = argparse.ArgumentParser()
parser.add_argument(
    "--batch_size",
    type=int,
    default=128,
    help="Batch size.",
)
parser.add_argument("--num_data", default=1024, type=int, help="Number of fake images.")
args = parser.parse_args()


train_dataset = datasets.FakeData(
    args.num_data, (3, 224, 224), 1000, transforms.ToTensor()
)
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=args.batch_size, num_workers=1, pin_memory=True
)
test_dataset = datasets.FakeData(
    args.num_data, (3, 224, 224), 1000, transforms.ToTensor()
)
test_loader = torch.utils.data.DataLoader(
    test_dataset, batch_size=args.batch_size, num_workers=1, pin_memory=True
)

model = torchvision.models.resnet50(pretrained=True)

optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = torch.nn.CrossEntropyLoss()


print("==================== Training ====================")
model.train()
for i, (images, target) in enumerate(train_loader):
    optimizer.zero_grad()

    start = time.time()
    outputs = model(images)
    end = time.time()
    print(f"Train forward time: {(end - start) * 1000.0} ms")

    loss = criterion(outputs, target)
    loss.backward()
    optimizer.step()


print("==================== Evaluation ====================")
model.eval()
for i, (images, target) in enumerate(test_loader):
    with torch.no_grad():
        start = time.time()
        outputs = model(images)
        end = time.time()
        print(f"Eval forward time: {(end - start) * 1000.0} ms")

I noticed that if I do training first then evaluation, then a forward pass during evaluation is slightly faster than a forward pass during training, which is expected. The runtimes are shown below.

==================== Training ====================
Train forward time: 2100.9457111358643 ms
Train forward time: 1974.3893146514893 ms
Train forward time: 1945.4665184020996 ms
Train forward time: 1943.62211227417 ms
Train forward time: 1888.5083198547363 ms
Train forward time: 1859.039068222046 ms
Train forward time: 1811.948537826538 ms
Train forward time: 1805.4358959197998 ms
==================== Evaluation ====================
Eval forward time: 2370.067834854126 ms
Eval forward time: 2061.3937377929688 ms
Eval forward time: 1844.5143699645996 ms
Eval forward time: 1753.0148029327393 ms
Eval forward time: 1701.3907432556152 ms
Eval forward time: 1688.025712966919 ms
Eval forward time: 1813.1353855133057 ms
Eval forward time: 1647.9554176330566 ms

However, if I do evaluation only, by commenting out the training code block, then a forward pass during evaluation will be significantly slower, and even slower than a forward pass during training.

==================== Evaluation ====================
Eval forward time: 2793.8458919525146 ms
Eval forward time: 2747.8232383728027 ms
Eval forward time: 2875.753164291382 ms
Eval forward time: 2738.5916709899902 ms
Eval forward time: 2732.877016067505 ms
Eval forward time: 2838.664770126343 ms
Eval forward time: 2783.207893371582 ms
Eval forward time: 2791.349411010742 ms

It looks like doing training first could speed up the forward pass during evaluation. How is this possible?