Bag of Words embeddings problem

serdarrader · June 19, 2020, 10:39am

I am trying to implement neural network using term-document matrix (and using sign() so it is like one hot encoded matrix). My term document matrix has 25k row and 800000 column. It is in the scipy csr matrix format. I created dictionary called res_dict that holds non-zero values’ indexes in each document like:

res_dict = {key: [] for key in x.nonzero()[0]}

for row, col in zip(*x.nonzero()):
    res_dict[row].append(col)

My network will be like this:

(w + w_adjusted) * r

where w is the learnable parameter, w_adjusted is constant number, and r is the constant vector that has the form (1,800000). r is coming from Naive Bayes(). I will give w_adjusted 0.4. I am trying to give credit to naive bayes’s result so I am adding w_adjusted to ensure regularization will not make weights 0 but w_adjusted.

w has the form (1,800000), weights for every words in dictionary, r (1, 800000). I will give res_dict[idx] to the network and do lookup and calculate the ouput. But the problem is that I don’t know how to do effective lookup for batch, I can do it for single entry but writng for loop for creating corresponding w etc does not seem effective to me.

Here is my network so far:

class Model(torch.nn.Module):
    def __init__(self, nf, r, b, w_adj = 0.4):
        super().__init__()
        self.w_adj = w_adj
        self.w = nn.Embedding(nf, 1) # (1, 800000)
        self.w.weight.data.uniform_(-0.1,0.1)
        self.b = nn.Parameter(torch.tensor(b)) # (1)
        self.r = r
        
    def forward(self, x):
        
        x =  x * (self.w + self.w_adj) # It will not be like this we should do lookup here
        x = x @ self.r.T + b
        return x

r is:

x = trn_term_doc
y = trn_y

p = x[y==1].sum(0) + 1
q = x[y==0].sum(0) + 1

r = np.log((p/p.sum())/(q/q.sum()))

b = np.log(len(p)/len(q))