Batch size brings small error in prediction

Assume we train a model using X in shape (batch_size, num_feature) and Y in shape (batch_size, output_num), then use it to predict some test input, sometimes it gives different results of model(X)[-n:] and model(X[-n:]). For example:

import torch
from sklearn.datasets import fetch_openml, fetch_california_housing
from import TensorDataset, DataLoader
from sklearn.preprocessing import MinMaxScaler

# load the data
X, y = fetch_california_housing(return_X_y=True)

# normalize the data
scaler_x = MinMaxScaler()
scaler_y = MinMaxScaler()
X = scaler_x.fit_transform(X)
y = scaler_y.fit_transform(y.reshape(-1, 1))
X, y = torch.tensor(X, dtype=torch.float32), torch.tensor(
    y, dtype=torch.float32)
print(X.shape, y.shape)
train = TensorDataset(X, y)
train_loader = DataLoader(train, batch_size=64, shuffle=False)

# define the model
model = torch.nn.Sequential(
    torch.nn.Linear(X.shape[1], 64),
    torch.nn.Linear(64, 128),
    torch.nn.Linear(128, y.shape[1])
    # torch.nn.Sigmoid()

# define the loss function and the optimizier
criterion = torch.nn.MSELoss(reduction='mean')
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(10):
    for x_batch, y_batch in train_loader:
        # forward pass: compute predicted y
        y_pred = model(x_batch)

        # compute loss
        loss = criterion(y_pred, y_batch)

        # backward pass

# test the model
n = 3
y_pred_whole = model(X).detach().numpy().flatten()
y_pred_tail = model(X[-n:]).detach().numpy().flatten()
print(y_pred_whole[-n:] == y_pred_tail)
print(y_pred_whole[-n:] - y_pred_tail)

The output on my computer (torch version 1.12.1)

torch.Size([20640, 8]) torch.Size([20640, 1])
[False False False]
[7.450581e-09 7.450581e-09 7.450581e-09]

A gap of e-09 lies between two prediction batches, but they are both the prediction results of the last 3 samples in the train data, which means they are supposed to be the same. I know it may be just the float precision issue (see IEEE 754), but it’s interesting that the error may change on different n values (last 1,2,3,4,… samples) or different runtime environments (run the sample code on Colab with version 2.1.0+cu118, the difference become [-1.4901161e-08 0.0000000e+00 0.0000000e+00]). I never come across the sample problem when using Keras because of the fixed batch size when building models, does it mean we also have to fix the batch size of inputs when sharing a trained/pre-trained model with others in PyTorch?

The small numerical mismatch is expected as you’ve already described due to the limited floating point precision and is visible if a different order of operations is performed as seen in this simple example:

x = torch.randn(100, 100)
s1 = x.sum()
s2 = x.sum(0).sum(0)
print((s1 - s2).abs())
# tensor(2.4796e-05)

Note that neither of the results is “more correct” and both should show a similar error to a wider dtype.

This would mean that Keras/TF uses the same algorithm for each workload, which I doubt is the case.

No, since the errors are expected and neither is more correct than the other.

1 Like