Dimensions after conv2d

I am trying to implement the following from Read Attend Code:

We use the definition tables of the diagnoses and procedure codes, concatenate long and short titles together for all ny codes, and build CT first. By tokenizing CT with nt tokens, we have a title matrix T where T ∈ R ny×nt. From T input, the module extracts a code-title embedding of dimension d by using an embedding layer followed by a single CNN layer and Global Max Pooling layer. We let Et ∈ R ny×d be the extracted code-title embedding matrix. In the model, each concatenated code title is padded to nt = 36 tokens, the same pre-trained Word2Vec Skip-gram model weights that the reader used are loaded to initialize the embedding layer, and a single CNN layer with d=300 filters, kernel size 10, and tanh activation function are used.

I have implemented this portion like so:

class Code_Title_Embedding(nn.Module):
    def __init__(self,config,embedding_weights):
        super(Code_Title_Embedding, self).__init__()
        self.config = config
        self.embedding = nn.Embedding.from_pretrained(embedding_weights,freeze=False)
        self.conv = nn.Conv2d(in_channels=300, out_channels=300, kernel_size=10)
        self.tanh = nn.Tanh()
        
    def forward(self,code_titles):
        x = self.embedding(code_titles) # shape from this is (1,1600 labels,36 tokens,300 embedsize)
        x = x.permute(0,3,2,1) # shape is now (1, 300 embed size, 1600 labels, 36 tokens)
        x = self.conv(x) # shape is now (1, 300 embed size, 1591 labels, 27 tokens) -> UH OH
        x, _ = torch.max(x, dim=3) # max embedding for each label sequence, shape is now (1,300,1591)
        x = x.permute(0,2,1) # shape is now (1,1591,300)
        x = self.tanh(x) # tanh activation
        return x

Clearly, I’m missing something here as I am ending up with a shape after conv2d of (1,300,1591,27). Am I doing something wrong? I know padding is a quick fix, but wouldn’t that be the equivalent of me not representing some of my labels in the object? I don’t think the above is correct because I end up with lost information across the num_labels dimension, aka 9 of my 1600 labels are not represented in the label matrix, which is to be used later as a query in a future custom attention block to find similarity between label and my input tensors (via key and value). He specifies that the object should be (batch_size,num_labels,embed_size) but I don’t think padding is appropriate.

Do you think he applies conv1d after a global max pooling instead, to preserve num_labels? It either has to be embedding layer → global max pooling → conv1d or embedding layer → conv2d → global max pooling. When doing the first option, I end up with (1,1600,27) which is fine as I understand conv1d and conv2d layers are supposed to reduce length. But the wording makes it sound like global max pooling happens after the conv layer. However, the author uses 1d conv earlier in the network across input embeddings in the same way. How confusing!