Embedding vs one-hot encoding performance for char-level RNNs

stared · April 30, 2018, 10:32am

Mathematically, embedding would be equivalent to one-hot encoding followed by a linear layer (which we may need or not). Though, for me the performance aspect is not clear. That is:

embeddings are sparse representations:
- sparse lookups may take more time (do they?)
one-hot encoding is a dense representation:
- more memory
- data transformation may be a bottleneck (e.g. one-hot encoding per batch)

While for word-level RNNs dense representations are unfeasible, for character-level (typically from 26 to ~100 dims, depending which characters you take) I saw both - networks with one-hot encoding and embeddings.

Code-wise and data-wise embeddings are better (cleaner code, no transformations). Though, is it at the cost of performance, and if so - how much? So:

Are there any rule of thumb where to use one?
Or are there any performance tests showing that for given conditions (number of characters, sequence length) one is way more performant that the other?