Training a model on sparse binary BOW vector of length 7K

I would like to compare the performance of a model that captures semantic relationships, such as a transformer, and a model trained on a bag of words vector. The BOW vector is of length 7K with about 100K examples and 21 classes.

Whats a reasonable starting point for architecture when dealing with a large binary vector like this as input? I’m still fairly new at pytorch so any guidance is appreciated!