Results of forward pass are different with different batch size


I find the forward pass result of the same data can be different with different batch size. Specifically, I run the code below:

import torch
import torch.nn as nn
import numpy as np
from import Dataset, DataLoader

class SimpleNetwork(nn.Module):
     def __init__(self):
          self.h1 = nn.Linear(16000, 1024)
          self.h2 = nn.Linear(1024, 512)
          self.h3 = nn.Linear(512, 256)
     def forward(self, data):
          x = self.h1(data)
          x = self.h2(x)
          o = self.h3(x)
          return o

class DummyDataset(Dataset):
     def __init__(self):
 = np.random.randn(32,16000)

     def __len__(self):
          return len(

     def __getitem__(self, index):
          tmp_data =[index]
          return tmp_data

if __name__ == "__main__":
     seed = 1365
     num_workers = 0

     device = torch.device("cuda")

     model = SimpleNetwork()

     dataset = DummyDataset()
     dataloader_batch16 = DataLoader(dataset, batch_size=16, shuffle=False, num_workers=num_workers)
     for batch_idx, data in enumerate(dataloader_batch16):
          print("first data item for batch 16")
          data = data.float().to(device)
          data_o = model(data)

     dataloader_batch1 = DataLoader(dataset, batch_size=1, shuffle=False, num_workers=num_workers)
     for batch_idx, data in enumerate(dataloader_batch1):          
          print("first data item for batch 1")
          data = data.float().to(device)
          data_o = model(data)

And I get

first data item for batch 16
tensor(12728.1973, device='cuda:0')
tensor(43.8060, device='cuda:0', grad_fn=<SumBackward0>)
first data item for batch 1
tensor(12728.1973, device='cuda:0')
tensor(43.8059, device='cuda:0', grad_fn=<SumBackward0>)

As you can see, the result of the first data item is different after forward pass. This seems happen for gpu only. If I run the same code on cpu, the result is the same for this case (I am not sure about other cases though).

Any ideas on the reason?


This is expected due to the limited floating point precision, as different algorithms can use a different order of operations. For float32 you would usually expect a rel. error of ~1e-6, but it also depends on the number of accumulations etc.
You would also expect to see the same on the CPU as it’s not hardware specific, but depends on the used algorithms:

x = torch.randn(10, 10, 10)
s1 = x.sum()
s2 = x.sum(0).sum(0).sum(0)
print((s1 - s2).abs())
# tensor(1.9073e-06)

I see. Thank you for your reply.


I have got a further question. I got different results for sum of tensor elements in the following code:

import torch

seed = 1365     

device = torch.device('cuda')

data = []
n_loop = 1024
for _ in range(n_loop):


s = 0
for d in data:
     s += d


tensor(8169029.5000, device='cuda:0')

Is it caused by the same reason of floating point precision as you mentioned before? In addition, I find the result is the same for some other n_loop values. Is it because in these settings tensor.sum() does the calculation in the same order with the second loop by chance?


Yes, the relative error is ~1e-7 so I would expect to see this small error based on the different order of operations.

1 Like