Best way to use sparse tensors / optimize this particular scenario?

Hello,

I am working on a custom NLP architecture, which uses two types of featurization for word-level tokens, as described in this paper. One of the two is the “sparse features”, which according to the authors is simply a multi-hot encoding of character-level n-grams, with n up to 5.

I assume the correct representation of the per-token n-gram multi-hot encoding is a very sparse vector of space complexity L^n, where L is the number of different characters in the language, and n is maximum n-gram length, which is roughly 12M entries for up-to-5-grams in English.

This “vector” has to be taken through a Linear(in_features=12M, out_features=d_model). Which takes an enormous amount of memory.

How can this task be optimized with sparse representations, or are there any other ways to optimize the memory needed for this transformation? So far, I have thought about sparsifying the representations themselves, but should I (and is there a way to) optimize the Linear layer itself?