Using genism dicitionary with torchtext!

Nilesh_Pandey · December 2, 2018, 6:46am

Hi, How could I use genism dicitionary with torchtext.

where docs is dataset after cleaning like [“cat” “sit” “table”]

docs_dict = Dictionary(docs)
docs_dict.filter_extremes(no_below=20, no_above=0.2)

docs_dict.compactify()
docs_corpus = [docs_dict.doc2bow(doc) for doc in docs]
model_tfidf = TfidfModel(docs_corpus, id2word=docs_dict)
docs_tfidf = model_tfidf[docs_corpus]
docs_tfid = np.vstack([sparse2full(c, len(docs_dict)) for c in docs_tfidf])
golve_vecs = np.vstack([nlp(docs_dict[i]).vector for i in range(len(docs_dict))])
docs_embedded_with_tfid = np.dot(docs_tfid,golve_vecs)

where nlp is glove embedding

any help or healthy discussion please.

Thanks.

Nilesh_Pandey · December 2, 2018, 7:29am

docs_embedded_with_tfid is a matrix with number_samples x 300. Any idea how could I pass this as dataset…

ptrblck · December 2, 2018, 12:31pm

You could just pre-compute these data matrix and pass it to the Dataset.
In the __getitem__ method you would get a single sample and transform it if necessary.
Here is a small example:

class MyDataset(Dataset):
    def __init__(self, embeddings, transform=None):
        self.data = embeddings
        self.transform = transform
        
    def __getitem__(self, index):
        x = self.data[index]
        if self.transform is not None:
            x = self.transform(x)
        return x
    
    def __len__(self):
        return len(self.data)


nb_samples = 100
docs_embedded_with_tfid = torch.randn(nb_samples, 300)
dataset = MyDataset(docs_embedded_with_tfid)
loader = DataLoader(
    dataset,
    batch_size=10,
    shuffle=True,
    pin_memory=torch.cuda.is_available(),
    num_workers=2
)

for batch_idx, data in enumerate(loader):
    print('Batch idx {}, data shape {}'.format(
        batch_idx, data.shape))

Nilesh_Pandey · December 2, 2018, 9:14pm

The problem is now with indexing, after extensive search I realized if I convert data to cuda it throws error.

ptrblck · December 2, 2018, 10:18pm

Could you check the dtype of your input to nn.Embedding?
It should be a torch.LongTensor containing indices.

Nilesh_Pandey · December 2, 2018, 10:53pm

The input has values from [0, 1] and the it is cuda float type.

The shape is batch_sizex300

ptrblck · December 2, 2018, 11:02pm

OK, then I might be mistaken.
Could you post your model and training code?
It seems docs_embedded_with_tfid is already some embedding output.
What kind of error do you get?

Nilesh_Pandey · December 2, 2018, 11:07pm

I want to use TFIDF embedding as input to the model. The TFIDF matrix is float unfortunately.

Nilesh_Pandey · December 2, 2018, 11:10pm

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=22)

X = vectorizer.fit_transform(train.comments)

y = X.toarray()

ptrblck · December 2, 2018, 11:12pm

As far as I know the TFIDF gives you the frequencies of different words, so using it as indices won’t work.
In that case, you could maybe pass the tensor directly to a linear layer with in_features=300?

Nilesh_Pandey · December 2, 2018, 11:16pm

class SimpleBiLSTMBaseline(nn.Module):
    def __init__(self, hidden_dim, emb_dim=300,
                 spatial_dropout=0.05, recurrent_dropout=0.1, num_linear=1):
        super().__init__() # don't forget to call this!
        self.embedding = nn.Embedding(size_vocab, emb_dim)
        self.encoder = nn.LSTM(emb_dim, hidden_dim, num_layers=1, dropout=recurrent_dropout)
        self.linear_layers = []
        for _ in range(num_linear - 1):
            self.linear_layers.append(nn.Linear(hidden_dim, hidden_dim))
        self.linear_layers = nn.ModuleList(self.linear_layers)
        self.predictor = nn.Linear(hidden_dim, 6)
    
    def forward(self, seq):
        hdn, _ = self.encoder(self.embedding(seq))
        feature = hdn[-1, :, :]
        for layer in self.linear_layers:
            feature = layer(feature)
        preds = self.predictor(feature)
        return preds

You’re suggesting to use nn.Linear() before nn.Encoding() or completely remove nn.Encoding()

ptrblck · December 2, 2018, 11:19pm

I’m not really familiar with text processing, so I’m not sure what the best approach would be using TFIDF features,
However, nn.Embedding expects a torch.LongTensor containing indices, so it won’t work with the TFIDF tensor. You could try to pass the word indices to it directly.
The Word Embeddings Tutorial might be a good starter.