Language model with 20m words

Xiaoyun_Wu · December 20, 2017, 12:30am

Just start to play with pytorch, and want to try extend the word language model to support dictionary with say 20m words, is this something possible with current pytorch, and any pointers on how should I go about this?

Thanks.

Xiaoyun_Wu · December 20, 2017, 12:36am

by the way, I am aware of: https://github.com/pytorch/pytorch/issues/1362 but there isn’t a conclusion yet?

SimonW · December 20, 2017, 12:44am

The standard why to deal with large dictionary is to use embeddings. Creating a tensaor like 20000000 x 15 isn’t too bad.

Xiaoyun_Wu · December 20, 2017, 1:18am

but then how do I get these on multi gpu? this along is already 40g of memory.

SimonW · December 20, 2017, 3:40am

Wait what? 3*10^8 float = 1.2*10^9 byte = 1.2G. Where did you get 40g?

Xiaoyun_Wu · December 20, 2017, 4:43am

20m x 512 x 4 = 4.0960 x 10 ^10

SimonW · December 20, 2017, 4:18pm

Hey I said 15. You should at least state that in case of 512-dimensional embedding, it will take 40g.

Just arbitrarily changing numbers without being saying anything about it is not going to help you better communicate your question and get it solved.

Store it somewhere else, and build the embedding for each sentence. It’s not that hard.