nn.TransformerEncoderLayer output changes with batch size

tamirs · October 23, 2024, 8:23am

Observing significantly different outputs from torch’s TransformerEncoderLayer when changing batch size (although I used batch_first=True) - anyone knows why?

Code to minimally reproduce:

#######

import torch
import numpy as np
import random 
import os

os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":16:8"

def set_random_seeds(seed=0, device='cuda:0'):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)  
    torch.use_deterministic_algorithms(True)  
    
    os.environ['PYTHONHASHSEED'] = str(seed)
    
    if device != 'cpu':
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False


set_random_seeds()


l1 = torch.nn.TransformerEncoderLayer(1024, 1, 1024, 0.0, batch_first=True).double().cuda()
l1.eval()


x = torch.rand((128, 1024)).double().cuda()
diff = (l1(x)[:2] - l1(x[:2])).abs().mean()
print(diff)

I get diff outputs of around 0.18 with torch version 2.3.1+cu121

ptrblck · October 24, 2024, 5:24pm

From the docs:

batch_first (bool) – If True, then the input and output tensors are provided as (batch, seq, feature). Default: False (seq, batch, feature).

So in your case the missing dim0 is interpreted as a single sample and you are slicing in the seq dimension. Adding the missing batch dimension gives the expected small mismatch:

x = torch.rand((128, 128, 1024)).double().to(device)
diff = (l1(x)[:2] - l1(x[:2])).abs().mean()
print(diff)
tensor(5.5092e-16, device='cuda:0', dtype=torch.float64,
       grad_fn=<MeanBackward0>)

tamirs · October 24, 2024, 5:38pm

Yes that clears it, totally missed that.
Thank you!