Understanding BERT from huggingface

Mohammed_Elhamamsy · November 14, 2023, 4:26pm

Hello all,

I’m new to BERT, and I’m trying to wrap my head around its structure.

from transformers import BertTokenizer, BertModel 
model = BertModel.from_pretrained('bert-base-uncased')

named_params = list(model.named_parameters())
print(f"{len(named_params)} Paramaters \n\n\n")


for p in named_params:
    print("{:<55} {:>12}".format(p[0],str(tuple(p[1].size()))))

The output of this is:

199 Paramaters

embeddings.word_embeddings.weight (30522, 768)
embeddings.position_embeddings.weight (512, 768)
embeddings.token_type_embeddings.weight (2, 768)
embeddings.LayerNorm.weight (768,)
embeddings.LayerNorm.bias (768,)
encoder.layer.0.attention.self.query.weight (768, 768)
encoder.layer.0.attention.self.query.bias (768,)
encoder.layer.0.attention.self.key.weight (768, 768)
encoder.layer.0.attention.self.key.bias (768,)
encoder.layer.0.attention.self.value.weight (768, 768)
encoder.layer.0.attention.self.value.bias (768,)
encoder.layer.0.attention.output.dense.weight (768, 768)
encoder.layer.0.attention.output.dense.bias (768,)
encoder.layer.0.attention.output.LayerNorm.weight (768,)
encoder.layer.0.attention.output.LayerNorm.bias (768,)
encoder.layer.0.intermediate.dense.weight (3072, 768)
encoder.layer.0.intermediate.dense.bias (3072,)
encoder.layer.0.output.dense.weight (768, 3072)
encoder.layer.0.output.dense.bias (768,)
encoder.layer.0.output.LayerNorm.weight (768,)
encoder.layer.0.output.LayerNorm.bias (768,)
…
encoder.layer.11.attention.self.query.weight (768, 768)
encoder.layer.11.attention.self.query.bias (768,)
encoder.layer.11.attention.self.key.weight (768, 768)
encoder.layer.11.attention.self.key.bias (768,)
encoder.layer.11.attention.self.value.weight (768, 768)
encoder.layer.11.attention.self.value.bias (768,)
encoder.layer.11.attention.output.dense.weight (768, 768)
encoder.layer.11.attention.output.dense.bias (768,)
encoder.layer.11.attention.output.LayerNorm.weight (768,)
encoder.layer.11.attention.output.LayerNorm.bias (768,)
encoder.layer.11.intermediate.dense.weight (3072, 768)
encoder.layer.11.intermediate.dense.bias (3072,)
encoder.layer.11.output.dense.weight (768, 3072)
encoder.layer.11.output.dense.bias (768,)
encoder.layer.11.output.LayerNorm.weight (768,)
encoder.layer.11.output.LayerNorm.bias (768,)
pooler.dense.weight (768, 768)
pooler.dense.bias (768,)

I understand that there are 12 layers (encoder stacks) each containing 12 attention heads, however, I can see that for each layer there’s a single Wq matrix, Wv matrix, and Wk matrix.

Shouldn’t there be 12 of them? What I’m getting wrong here?

J_Johnson · November 14, 2023, 4:51pm

The attention head mechanism does not increase the number of parameters. It just divides up the attention calculations into 12 distinct sections during the attention forward pass.

Here is a code example:

github.com

lucidrains/PaLM-rlhf-pytorch/blob/main/palm_rlhf_pytorch/attention.py

import torch
from torch import nn, einsum
import torch.nn.functional as F

from collections import namedtuple
from functools import wraps
from packaging import version

from einops import rearrange

# constants

Config = namedtuple('EfficientAttentionConfig', ['enable_flash', 'enable_math', 'enable_mem_efficient'])

# helpers

def exists(val):
    return val is not None

def once(fn):

This file has been truncated. show original

Mohammed_Elhamamsy · November 15, 2023, 6:52am

Thank you, I’ll check it out

vdw · December 10, 2023, 8:26am

This is directly taken from the original Transformers paper:

(see the last paragraph)

Mohammed_Elhamamsy · December 11, 2023, 12:48am

Thanks for sharing Chris,

Yes, I got that after some time. I was just confused at first when I saw one weight matrix per layer.
However, thinking about that, the idea of multiplying matrix A with 8 Wi matrices then concatenating the results is the same as concatenating the 8 Wi matrices (side by side) then performing the multiplication.

Have a good day!

vdw · December 11, 2023, 6:17am

Yes, having 1 big matrix “containing” the matrices for each head has performance benefits.