Reshaping the matrix in a proper way for convolution

coyote · October 26, 2018, 9:41am

Hi everyone,

I am new to Pytorch (and row major calculations). I would like to build a convolutional neural network for text based applications. My batch size is 64 (64 sentences in each batch), embedding size is 200 and each sentence contains 80 words. Inside the model (in init method) I initialize my embeddings as follows:

batch_size = 64
embedding_dim = 200
vocabulary_size = 100
sentence_len = 80
out_channel = 100
self.embedding = nn.Embedding(vocabulary_size,embedding_dim)
# here is the convolutional layer I want to use:
self.conv1 = nn.Conv2d(1,out_channel, kernel_size = (2,embedding_dim))

My input x contains the vocabulary index of the words of sentences in my batch:
x.shape = batch_size,sentence_len.

When I use x to get my embeddings I obtain the [batch_size,sentence_len,embedding_dim] size tensor.
As far as I know, I should reshape it to batch_size,Input_Channel,H,W shape to use it as input to convolutional layer.
My question: If I do it as follows, would it reshape my embeds correctly ?

  def forward(self,x,y):
       embeds     = self.embedding(x) #  embeds shape is 64x 80 x 200 
        #embeds.reshape()
       embeds2  = embeds.view(batch_size,1,sentence_len,embedding_dim)
       out = self.conv1(embeds2)

ptrblck · October 26, 2018, 12:01pm

If you use nn.Conv2d you would have to stick to your shapes, i.e. [batch_size, c, h, w].
In your example the conv layer would convolve the embedding tensor using its kernel size in sentence_len and embedding_dim dimension. I’m not sure if that’s the most useful method regarding your use case.

Maybe a nn.Conv1d layer would fit a bit better, as it’s expecting the input to be in shape [batch_size, channel, length]. If you permute your dimensions to get [batch_size, embedding_dim, sentence_len], each kernel will use all channels in the default settings and its kernel along the sentence_len dim.

coyote · October 26, 2018, 12:19pm

Hi @ptrblck thank you so much for your reply.

This isn’t what I want to do is. I would like to convolve the embedding 2x embedding_dim. I’ve read your reply but since I am pretty new with pytorch, I couldn’t understand it completely. If you don’t mind could you explain it a little bit more ? (Maybe with a simple example?)

Let me give more details about my problem. I would like to apply n-gram filters for my sentence where n is 2. So, I want to convolve the embeddings of each consecutive 2 words in the sentence. When I check the output of my convolution operation it seems it gives the correct shape (64, 100, 79, 1) where 79 = Sentence-len-2+1 but results I obtain tell me I have a bug somewhere in the code.

coyote · October 26, 2018, 1:16pm

Hi again ptrblck,

I tried to did what you suggest but it did not change anything I guess. Could you check ( if possible) this is what you really suggested ?

There are more than one convolution layer with different kernel sizes in my imlpementation but they all perform on the same embedding matrix. Therefore, If I can fix one of them, I would be able to apply the same thing to all.

class Anger2(nn.Module):
    # out_channel: # output channel of convolutions (100)
    # dense_out :  1st dense layer output  dimension (30)
    def __init__(self,embedding_matrix,out_channel,dense_out,n_gram_num):
        super(Anger2, self).__init__()
        num_embeddings,embedding_dim = embedding_matrix.shape
        self.embedding = nn.Embedding(num_embeddings,embedding_dim)
        self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix,dtype=torch.float))
        self.conv2  = nn.Conv1d(embedding_dim,out_channel, kernel_size = 2)
        self.conv3  = nn.Conv1d(embedding_dim,out_channel, kernel_size = 3)
        self.conv4  = nn.Conv1d(embedding_dim,out_channel, kernel_size = 4)
        self.conv5  = nn.Conv1d(embedding_dim,out_channel, kernel_size = 5)
        self.conv6  = nn.Conv1d(embedding_dim,out_channel, kernel_size = 6) 
        # https://pytorch.org/docs/stable/nn.html#torch.nn.BatchNorm1d
        self.dense1_bn = nn.BatchNorm1d(out_channel*n_gram_num)
        self.drop = nn.Dropout(p=0.5)
        self.fc1 = nn.Linear(out_channel*n_gram_num, dense_out)
        self.fc2 = nn.Linear(dense_out, 1)

     
    def forward(self,x,y):
        # x shape: BSxMAX_LEN
        embeds     = self.embedding(x) # BS(64) x MAX_LEN(80) x EMBED_SIZE(200)
        embeds = embeds.view(x.shape[0],-1,x.shape[1]) 

        o2 = self.conv2(embeds)
        o2 = torch.squeeze(F.max_pool1d(o2,o2.shape[2]))

        o3 = self.conv3(embeds)
        o3 = torch.squeeze(F.max_pool1d(o3,o3.shape[2]))
        
        o4 = self.conv4(embeds)
        o4 = torch.squeeze(F.max_pool1d(o4,o4.shape[2]))
        
        o5 = self.conv5(embeds)
        o5 = torch.squeeze(F.max_pool1d(o5,o5.shape[2]))
 
        o6 = self.conv6(embeds)
        o6 = torch.squeeze(F.max_pool1d(o6,o6.shape[2]))

        concatenated = torch.cat((o2,o3,o4,o5,o6), 1) # 64x500
        concatenated = self.dense1_bn(concatenated)
        out = self.drop(concatenated)
        out = F.sigmoid(self.fc1(out))
        out = F.sigmoid(self.fc2(out))
        return out

ptrblck · October 26, 2018, 1:47pm

I’ve created a small example here:

batch_size = 64
embedding_dim = 200
vocabulary_size = 100
sentence_len = 80
out_channel = 100

embedding = nn.Embedding(vocabulary_size, embedding_dim)
conv1 = nn.Conv1d(embedding_dim, out_channel, kernel_size=2)

x = torch.empty(batch_size, sentence_len, dtype=torch.long).random_(vocabulary_size)
output = embedding(x)
output = output.permute(0, 2, 1)
output = conv1(output)

I used the Word Embeddings Tutorial as a reference implementation.

The embedding layer will return an output of shape [batch_size, sentence_len, embedding_dim]. After the permute call the shape will be [batch_size, embedding_dim, sentence_len], so that the following conv layer will use all embedding_dim channels for the convolution and a kernel_size=2 so that two neighboring “words” will be used.

Let me know, if that would work for you or if I misunderstood something.

coyote · October 26, 2018, 3:36pm

Again thank you so much for the code snippet. Actually, I was doing almost the as you did . Unlike you, I was using view rather than permute. I changed it to permute too. However I still doesn’t get the results explained in the paper.

The arhitecture explain in the paper can be seen below. I believe that is exactly what I’ve implemented.

What could be wrong do you know any idea?

Below I again share my complete architecture:

class Anger2(nn.Module):
    # out_channel: # output channel of convolutions (200 above I explained it as if 100 but actually it is 200)
    # dense_out :  1st dense layer output  dimension (30)
    def __init__(self,embedding_matrix,out_channel,dense_out,n_gram_num):
        super(Anger2, self).__init__()
        num_embeddings,embedding_dim = embedding_matrix.shape
        self.embedding = nn.Embedding(num_embeddings,embedding_dim)
        self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix,dtype=torch.float))
        self.conv2  = nn.Conv1d(embedding_dim,out_channel, kernel_size = 2)
        self.conv3  = nn.Conv1d(embedding_dim,out_channel, kernel_size = 3)
        self.conv4  = nn.Conv1d(embedding_dim,out_channel, kernel_size = 4)
        self.conv5  = nn.Conv1d(embedding_dim,out_channel, kernel_size = 5)
        self.conv6  = nn.Conv1d(embedding_dim,out_channel, kernel_size = 6) 
        # https://pytorch.org/docs/stable/nn.html#torch.nn.BatchNorm1d
        self.dense1_bn = nn.BatchNorm1d(out_channel*n_gram_num)
        self.drop = nn.Dropout(p=0.5)
        self.fc1 = nn.Linear(out_channel*n_gram_num, dense_out)
        self.sg1 = nn.Sigmoid()
        self.fc2 = nn.Linear(dense_out, 1)
        self.sg2 = nn.Sigmoid()
         
    def forward(self,x):
        # x shape: BSxMAX_LEN
        embeds     = self.embedding(x) # BS(64) x MAX_LEN(80) x EMBED_SIZE(200)
        embeds = embeds.permute(0, 2, 1)
        o2 = self.conv2(embeds) # 64,200,79
        o2 = F.max_pool1d(o2,o2.shape[2])# 64,200,1
        o2 = torch.squeeze(o2) # 64,200

        o3 = self.conv3(embeds)
        o3 = torch.squeeze(F.max_pool1d(o3,o3.shape[2]))
        
        o4 = self.conv4(embeds)
        o4 = torch.squeeze(F.max_pool1d(o4,o4.shape[2]))
        
        o5 = self.conv5(embeds)
        o5 = torch.squeeze(F.max_pool1d(o5,o5.shape[2]))
 
        o6 = self.conv6(embeds)
        o6 = torch.squeeze(F.max_pool1d(o6,o6.shape[2]))
       
        concatenated = torch.cat((o2,o3,o4,o5,o6), 1) # 64x1000
        concatenated = self.dense1_bn(concatenated)
        out = self.drop(concatenated)
        out = self.sg1(self.fc1(out))
        out = self.sg2(self.fc2(out))
        return out


# AND HERE IS TRAINING
EPOCH=30
BATCH_SIZE=64

loss_function = nn.MSELoss() # take sqrt to use it as RMSE
model = Anger2(embeddings,FILTER_NUMBER,DENSE_OUT,N_GRAM_NUM)
model.to(device)
#print(model)
optimizer = optim.Adam(model.parameters())
trn_batches = make_batch(trn,label_trn,is_shuffle=True)
dev_batches = make_batch(dev,label_dev,is_shuffle=True)
best_val = 1000.0 
for epoch in range(EPOCH):
    total_loss = 0.0
    model.train()
    for (trn_x,trn_y) in trn_batches:
        model.zero_grad()
        trn_x = trn_x.to(device)
        trn_y = trn_y.to(device)
        outs = model(trn_x)
        loss = torch.sqrt(loss_function(outs,trn_y))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    avg_trn_loss = total_loss/len(trn_batches)

ptrblck · October 26, 2018, 8:28pm

Thanks for the code! I compared the figures from the paper and your implementation and it looks good!
What results do you get? Is your model much worse then their results?

coyote · October 28, 2018, 4:00pm

Hello again ! I realized a bug in the data preprocessing part of my code. Now it works as expected! Thank you so much for your help.