Transformer dropout at inference time

looking at the TransformerEncoderLayer and TransformerDecoderLayer code, it seems that at inference time dropout is applied without change.

I thought that the dropout is not used at inference time. Is it correct or am I missing something?

The forward pass looks the same, but during inference you should change the module’s mode to eval (call model.eval()). This will update the module’s internal flag and do the same for each submodule (this means you can call this once on the top-level module). This flag is used if the module behavior differs during training (call module.train() to enter the train mode) and evaluation (call module.eval() to enter the eval mode). So nn.Dropout (or e.g. nn.BatchNorm) relies on that functionality to change its behavior during training/evaluation.

This means that calling module.eval() is not neccesary in cases where there is no layer that has different behavior between training and evaluation (e.g. all layers are nn.Linear or nn.Conv) but still it’s a good practice.

Yes this was clear, I did not understood that this is enforced inside the Dropout module.