Using genism dicitionary with torchtext!

Hi, How could I use genism dicitionary with torchtext.

where docs is dataset after cleaning like [“cat” “sit” “table”]

docs_dict = Dictionary(docs)
docs_dict.filter_extremes(no_below=20, no_above=0.2)

docs_dict.compactify()
docs_corpus = [docs_dict.doc2bow(doc) for doc in docs]
model_tfidf = TfidfModel(docs_corpus, id2word=docs_dict)
docs_tfidf = model_tfidf[docs_corpus]
docs_tfid = np.vstack([sparse2full(c, len(docs_dict)) for c in docs_tfidf])
golve_vecs = np.vstack([nlp(docs_dict[i]).vector for i in range(len(docs_dict))])
docs_embedded_with_tfid = np.dot(docs_tfid,golve_vecs)

where nlp is glove embedding

any help or healthy discussion please.

Thanks.

docs_embedded_with_tfid is a matrix with number_samples x 300. Any idea how could I pass this as dataset…

You could just pre-compute these data matrix and pass it to the Dataset.
In the __getitem__ method you would get a single sample and transform it if necessary.
Here is a small example:

class MyDataset(Dataset):
    def __init__(self, embeddings, transform=None):
        self.data = embeddings
        self.transform = transform
        
    def __getitem__(self, index):
        x = self.data[index]
        if self.transform is not None:
            x = self.transform(x)
        return x
    
    def __len__(self):
        return len(self.data)


nb_samples = 100
docs_embedded_with_tfid = torch.randn(nb_samples, 300)
dataset = MyDataset(docs_embedded_with_tfid)
loader = DataLoader(
    dataset,
    batch_size=10,
    shuffle=True,
    pin_memory=torch.cuda.is_available(),
    num_workers=2
)

for batch_idx, data in enumerate(loader):
    print('Batch idx {}, data shape {}'.format(
        batch_idx, data.shape))
1 Like

The problem is now with indexing, after extensive search I realized if I convert data to cuda it throws error.

Could you check the dtype of your input to nn.Embedding?
It should be a torch.LongTensor containing indices.

The input has values from [0, 1] and the it is cuda float type.

The shape is batch_sizex300

OK, then I might be mistaken.
Could you post your model and training code?
It seems docs_embedded_with_tfid is already some embedding output.
What kind of error do you get?

I want to use TFIDF embedding as input to the model. The TFIDF matrix is float unfortunately.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=22)

X = vectorizer.fit_transform(train.comments)

y = X.toarray()

As far as I know the TFIDF gives you the frequencies of different words, so using it as indices won’t work.
In that case, you could maybe pass the tensor directly to a linear layer with in_features=300?

class SimpleBiLSTMBaseline(nn.Module):
    def __init__(self, hidden_dim, emb_dim=300,
                 spatial_dropout=0.05, recurrent_dropout=0.1, num_linear=1):
        super().__init__() # don't forget to call this!
        self.embedding = nn.Embedding(size_vocab, emb_dim)
        self.encoder = nn.LSTM(emb_dim, hidden_dim, num_layers=1, dropout=recurrent_dropout)
        self.linear_layers = []
        for _ in range(num_linear - 1):
            self.linear_layers.append(nn.Linear(hidden_dim, hidden_dim))
        self.linear_layers = nn.ModuleList(self.linear_layers)
        self.predictor = nn.Linear(hidden_dim, 6)
    
    def forward(self, seq):
        hdn, _ = self.encoder(self.embedding(seq))
        feature = hdn[-1, :, :]
        for layer in self.linear_layers:
            feature = layer(feature)
        preds = self.predictor(feature)
        return preds

You’re suggesting to use nn.Linear() before nn.Encoding() or completely remove nn.Encoding()

I’m not really familiar with text processing, so I’m not sure what the best approach would be using TFIDF features,
However, nn.Embedding expects a torch.LongTensor containing indices, so it won’t work with the TFIDF tensor. You could try to pass the word indices to it directly.
The Word Embeddings Tutorial might be a good starter.

1 Like