Albert Model Display and Copying weights inside Albert

0

I wanted to try something.

There are 2 experiments, one on Bert an one on Albert. The task is, Train Bert upto Layers 5 , keeping the other 7 fixed, and then during test time, copy the weights of the 5th layer onto layers 6-12. I was successfully able to do this, as Bert has seperate parameters, and I can manipulate them seperately.

Whereas in Albert, the Architecture is such that all the weights are shared from layers 1-12, so, although I can train them till layer 5, they aren’t accessible seperately, so, I am unable to copy over the weights onto further layers.

Any suggestions how I might go about this?

I tried printing the entire model, for both Bert and Albert.

> (embeddings): BertEmbeddings(
>       (word_embeddings): Embedding(30522, 768, padding_idx=0)
>       (position_embeddings): Embedding(512, 768)
>       (token_type_embeddings): Embedding(2, 768)
>       (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
>       (dropout): Dropout(p=0.1, inplace=False)
>     )
> (encoder): BertEncoder(
>       (layer): ModuleList(
>         (0): BertLayer(
>           (attention): BertAttention(
>             (self): BertSelfAttention(
>               (query): Linear(in_features=768, out_features=768, bias=True)
>               (key): Linear(in_features=768, out_features=768, bias=True)
>               (value): Linear(in_features=768, out_features=768, bias=True)
>               (dropout): Dropout(p=0.1, inplace=False)
>             )
>             (output): BertSelfOutput(
>               (dense): Linear(in_features=768, out_features=768, bias=True)
>               (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
>               (dropout): Dropout(p=0.1, inplace=False)
>             )
>           )
>           (intermediate): BertIntermediate(
>             (dense): Linear(in_features=768, out_features=3072, bias=True)
>           )
>           (output): BertOutput(
>             (dense): Linear(in_features=3072, out_features=768, bias=True)
>             (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
>             (dropout): Dropout(p=0.1, inplace=False)
>           )
>         )
> (1): BertLayer(
>           (attention): BertAttention(
>             (self): BertSelfAttention(
>               (query): Linear(in_features=768, out_features=768, bias=True)
>               (key): Linear(in_features=768, out_features=768, bias=True)
>               (value): Linear(in_features=768, out_features=768, bias=True)
>               (dropout): Dropout(p=0.1, inplace=False)
>             )
>             (output): BertSelfOutput(
>               (dense): Linear(in_features=768, out_features=768, bias=True)
>               (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
>               (dropout): Dropout(p=0.1, inplace=False)
>             )
>           )
>           (intermediate): BertIntermediate(
>             (dense): Linear(in_features=768, out_features=3072, bias=True)
>           )
>           (output): BertOutput(
>             (dense): Linear(in_features=3072, out_features=768, bias=True)
>             (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
>             (dropout): Dropout(p=0.1, inplace=False)
>           )
>         )

And so on, for layers uptil 11. (They are seperately accessible).

For Albert:

> Model(
>   (albert): AlbertModel(
>     (embeddings): AlbertEmbeddings(
>       (word_embeddings): Embedding(30000, 128, padding_idx=0)
>       (position_embeddings): Embedding(512, 128)
>       (token_type_embeddings): Embedding(2, 128)
>       (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
>       (dropout): Dropout(p=0, inplace=False)
>     )
> (encoder): AlbertTransformer(
>       (embedding_hidden_mapping_in): Linear(in_features=128, out_features=768, bias=True)
>       (albert_layer_groups): ModuleList(
>         (0): AlbertLayerGroup(
>           (albert_layers): ModuleList(
>             (0): AlbertLayer(
>               (full_layer_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
>               (attention): AlbertAttention(
>                 (query): Linear(in_features=768, out_features=768, bias=True)
>                 (key): Linear(in_features=768, out_features=768, bias=True)
>                 (value): Linear(in_features=768, out_features=768, bias=True)
>                 (dropout): Dropout(p=0, inplace=False)
>                 (dense): Linear(in_features=768, out_features=768, bias=True)
>                 (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
>               )
>               (ffn): Linear(in_features=768, out_features=3072, bias=True)
>               (ffn_output): Linear(in_features=3072, out_features=768, bias=True)
>             )
>           )
>         )
>       )
>     )

The weights can’t be copied directly, as they are the same weights. (I want to have weights trained only till layer 5, and then copy layer 5 weights till layer 12, without training it explicitly. How do I go about this?