Encoding Train, Validation, Test

Hi, I have a dataset with roughly 50 classes and would like to perform multi-class classification on it. However, for some first experiments, I only want to consider the top 20 classes for train, validation, and test, respectively. Meaning, the classes in train, validation, and test will only partly overlap. However, my classes are strings, so I have to encode them but I am not sure how. I assumed that I have to encode the classes after splitting them. But then, do I have to make sure that class 1 in the training set corresponds to class 1 in the validation set (if that class exists in both)? What about the classes that are only in validation, not in test? I can only assign numbers from 0-19 to them, since I set the number of classes in the LSTM output layer to 20, so it will run into an error if there are classes greater 19.

If I understand your question right, you want to validate and test your model on Subsets that contain samples of classes that are not in your trainset.
This is a highly challenging problem (see: Zero-Shot learning), and if it is not exactly your goal to train a model that can recognize a class that is not present in the training set I would strongly recommend against that.
In either case, your class indices must always be the exact same during training, validation and testing, e.g. “Class 0” must always be encoded with index 0, “Class 1” with index 1 and so on.
As a first experiment I would recommend to only use the samples of classes that are present in all three subset and If possible balance the number of samples of each class (for example via stochastic Undersampling), so that your training-, validation- and testing-data will have the same underlying distribution. This will help you to get an intuition of how your model performs on your data.

Hi! Sorry for my late reply and thanks for your help! My goals is indeed to train a model that is able to recognize unknown classes. I have now encoded the training classes from 0-4 and then continuing in the validation set from 5-9. However, I now do not know how to initialize the model. If I tell the model that there are 10 classes (train classes + validation classes) I assume there will be a data leakage. But when initializing the model with only 5 classes, I obviously get an error in the validation set, since the indices are to high.

I am afraid but as I am no expert in Zero-Shot learning, I can only recommend to study the literature on that specific topic.
You are right, specifying the number of classes via the number of output neurons provides some information to your model, but I don’t see another way of initializing your model. However, I would argue that specifying the number of classes is just a hypothesis about your classification problem and therefore acceptable. You can check out “Mitchell. ‘Machine Learning’. 1997” for an example of the futility of bias-free learning.

Hey, thanks for your fast reply! I will definitely check out the literature you recommended :slight_smile: