Load pretrained weights of BERT, XLNet, GPT in torch.nn.transformers

I am working on a script in which I want to use BERT’s encoding method as well as GPT’s decoding method for NMT. Since I need to pass the encoder’s hidden state value to decoders, I want to know how can I use torch.nn.Transformers module. Going through huggingface transformer I found that GPT expects key-value pairs as its input from encoder layer which has a shape of (2, batch_size, num_heads, sequence_length, embed_size_per_head) and BERT outputs the hidden states of each layer(+embedding layer), therefore I also would like to know how could I get the key-value pair from BERT.