Language model with 20m words

Just start to play with pytorch, and want to try extend the word language model to support dictionary with say 20m words, is this something possible with current pytorch, and any pointers on how should I go about this?


by the way, I am aware of: but there isn’t a conclusion yet?

The standard why to deal with large dictionary is to use embeddings. Creating a tensaor like 20000000 x 15 isn’t too bad.

but then how do I get these on multi gpu? this along is already 40g of memory.

Wait what? 3*10^8 float = 1.2*10^9 byte = 1.2G. Where did you get 40g?

20m x 512 x 4 = 4.0960 x 10 ^10

Hey I said 15. You should at least state that in case of 512-dimensional embedding, it will take 40g.

Just arbitrarily changing numbers without being saying anything about it is not going to help you better communicate your question and get it solved.

Store it somewhere else, and build the embedding for each sentence. It’s not that hard.