docs_dict.compactify()
docs_corpus = [docs_dict.doc2bow(doc) for doc in docs]
model_tfidf = TfidfModel(docs_corpus, id2word=docs_dict)
docs_tfidf = model_tfidf[docs_corpus]
docs_tfid = np.vstack([sparse2full(c, len(docs_dict)) for c in docs_tfidf])
golve_vecs = np.vstack([nlp(docs_dict[i]).vector for i in range(len(docs_dict))])
docs_embedded_with_tfid = np.dot(docs_tfid,golve_vecs)
You could just pre-compute these data matrix and pass it to the Dataset.
In the __getitem__ method you would get a single sample and transform it if necessary.
Here is a small example:
class MyDataset(Dataset):
def __init__(self, embeddings, transform=None):
self.data = embeddings
self.transform = transform
def __getitem__(self, index):
x = self.data[index]
if self.transform is not None:
x = self.transform(x)
return x
def __len__(self):
return len(self.data)
nb_samples = 100
docs_embedded_with_tfid = torch.randn(nb_samples, 300)
dataset = MyDataset(docs_embedded_with_tfid)
loader = DataLoader(
dataset,
batch_size=10,
shuffle=True,
pin_memory=torch.cuda.is_available(),
num_workers=2
)
for batch_idx, data in enumerate(loader):
print('Batch idx {}, data shape {}'.format(
batch_idx, data.shape))
OK, then I might be mistaken.
Could you post your model and training code?
It seems docs_embedded_with_tfid is already some embedding output.
What kind of error do you get?
As far as I know the TFIDF gives you the frequencies of different words, so using it as indices won’t work.
In that case, you could maybe pass the tensor directly to a linear layer with in_features=300?
I’m not really familiar with text processing, so I’m not sure what the best approach would be using TFIDF features,
However, nn.Embedding expects a torch.LongTensor containing indices, so it won’t work with the TFIDF tensor. You could try to pass the word indices to it directly.
The Word Embeddings Tutorial might be a good starter.