I have a tabular pytorch model that takes in cities and zipcodes as a categorical embeddings. However, my hold out set has zipcodes and cities that were not in my train/test set. This is also a real possibility if this model goes into deployment.
How do I set up a model to run even if the holdout set, or new data to infer has categorical values that are new?
For example, I have 2819 unique zip codes in my training and test set. That leaves an embedding size of (2918, 50)
. I have new zip codes in my hold out sample that yields an embedding size of (8684, 50)
.
As I get new values in, I may have a zip code never seen before. Is there a way to account for this?