How does one deal with different categories in pytorch train, test, and holdout set

Jordan_Howell · December 3, 2019, 2:44pm

I have a tabular pytorch model that takes in cities and zipcodes as a categorical embeddings. However, my hold out set has zipcodes and cities that were not in my train/test set. This is also a real possibility if this model goes into deployment.

How do I set up a model to run even if the holdout set, or new data to infer has categorical values that are new?

For example, I have 2819 unique zip codes in my training and test set. That leaves an embedding size of (2918, 50). I have new zip codes in my hold out sample that yields an embedding size of (8684, 50).

As I get new values in, I may have a zip code never seen before. Is there a way to account for this?