Mathematically, embedding would be equivalent to one-hot encoding followed by a linear layer (which we may need or not). Though, for me the performance aspect is not clear. That is:
- embeddings are sparse representations:
- sparse lookups may take more time (do they?)
- one-hot encoding is a dense representation:
- more memory
- data transformation may be a bottleneck (e.g. one-hot encoding per batch)
While for word-level RNNs dense representations are unfeasible, for character-level (typically from 26 to ~100 dims, depending which characters you take) I saw both - networks with one-hot encoding and embeddings.
Code-wise and data-wise embeddings are better (cleaner code, no transformations). Though, is it at the cost of performance, and if so - how much? So:
- Are there any rule of thumb where to use one?
- Or are there any performance tests showing that for given conditions (number of characters, sequence length) one is way more performant that the other?