Embedding vs one-hot encoding performance for char-level RNNs

Mathematically, embedding would be equivalent to one-hot encoding followed by a linear layer (which we may need or not). Though, for me the performance aspect is not clear. That is:

  • embeddings are sparse representations:
    • sparse lookups may take more time (do they?)
  • one-hot encoding is a dense representation:
    • more memory
    • data transformation may be a bottleneck (e.g. one-hot encoding per batch)

While for word-level RNNs dense representations are unfeasible, for character-level (typically from 26 to ~100 dims, depending which characters you take) I saw both - networks with one-hot encoding and embeddings.

Code-wise and data-wise embeddings are better (cleaner code, no transformations). Though, is it at the cost of performance, and if so - how much? So:

  • Are there any rule of thumb where to use one?
  • Or are there any performance tests showing that for given conditions (number of characters, sequence length) one is way more performant that the other?
1 Like