0
I wanted to try something.
There are 2 experiments, one on Bert an one on Albert. The task is, Train Bert upto Layers 5 , keeping the other 7 fixed, and then during test time, copy the weights of the 5th layer onto layers 6-12. I was successfully able to do this, as Bert has seperate parameters, and I can manipulate them seperately.
Whereas in Albert, the Architecture is such that all the weights are shared from layers 1-12, so, although I can train them till layer 5, they aren’t accessible seperately, so, I am unable to copy over the weights onto further layers.
Any suggestions how I might go about this?
I tried printing the entire model, for both Bert and Albert.
> (embeddings): BertEmbeddings(
> (word_embeddings): Embedding(30522, 768, padding_idx=0)
> (position_embeddings): Embedding(512, 768)
> (token_type_embeddings): Embedding(2, 768)
> (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
> (dropout): Dropout(p=0.1, inplace=False)
> )
> (encoder): BertEncoder(
> (layer): ModuleList(
> (0): BertLayer(
> (attention): BertAttention(
> (self): BertSelfAttention(
> (query): Linear(in_features=768, out_features=768, bias=True)
> (key): Linear(in_features=768, out_features=768, bias=True)
> (value): Linear(in_features=768, out_features=768, bias=True)
> (dropout): Dropout(p=0.1, inplace=False)
> )
> (output): BertSelfOutput(
> (dense): Linear(in_features=768, out_features=768, bias=True)
> (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
> (dropout): Dropout(p=0.1, inplace=False)
> )
> )
> (intermediate): BertIntermediate(
> (dense): Linear(in_features=768, out_features=3072, bias=True)
> )
> (output): BertOutput(
> (dense): Linear(in_features=3072, out_features=768, bias=True)
> (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
> (dropout): Dropout(p=0.1, inplace=False)
> )
> )
> (1): BertLayer(
> (attention): BertAttention(
> (self): BertSelfAttention(
> (query): Linear(in_features=768, out_features=768, bias=True)
> (key): Linear(in_features=768, out_features=768, bias=True)
> (value): Linear(in_features=768, out_features=768, bias=True)
> (dropout): Dropout(p=0.1, inplace=False)
> )
> (output): BertSelfOutput(
> (dense): Linear(in_features=768, out_features=768, bias=True)
> (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
> (dropout): Dropout(p=0.1, inplace=False)
> )
> )
> (intermediate): BertIntermediate(
> (dense): Linear(in_features=768, out_features=3072, bias=True)
> )
> (output): BertOutput(
> (dense): Linear(in_features=3072, out_features=768, bias=True)
> (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
> (dropout): Dropout(p=0.1, inplace=False)
> )
> )
And so on, for layers uptil 11. (They are seperately accessible).
For Albert:
> Model(
> (albert): AlbertModel(
> (embeddings): AlbertEmbeddings(
> (word_embeddings): Embedding(30000, 128, padding_idx=0)
> (position_embeddings): Embedding(512, 128)
> (token_type_embeddings): Embedding(2, 128)
> (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
> (dropout): Dropout(p=0, inplace=False)
> )
> (encoder): AlbertTransformer(
> (embedding_hidden_mapping_in): Linear(in_features=128, out_features=768, bias=True)
> (albert_layer_groups): ModuleList(
> (0): AlbertLayerGroup(
> (albert_layers): ModuleList(
> (0): AlbertLayer(
> (full_layer_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
> (attention): AlbertAttention(
> (query): Linear(in_features=768, out_features=768, bias=True)
> (key): Linear(in_features=768, out_features=768, bias=True)
> (value): Linear(in_features=768, out_features=768, bias=True)
> (dropout): Dropout(p=0, inplace=False)
> (dense): Linear(in_features=768, out_features=768, bias=True)
> (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
> )
> (ffn): Linear(in_features=768, out_features=3072, bias=True)
> (ffn_output): Linear(in_features=3072, out_features=768, bias=True)
> )
> )
> )
> )
> )
The weights can’t be copied directly, as they are the same weights. (I want to have weights trained only till layer 5, and then copy layer 5 weights till layer 12, without training it explicitly. How do I go about this?