My question is not specific to PyTorch but I believe this is still a good place to ask this question. Basically I have two related questions. I have a single text file from which I should extract training and test sets. It is kind of a small dataset to be used to train Neural Networks but I have no choice at that moment.
My first question is that, is it a good choice to extract 10% of the dataset randomly and use it as test set or should I do something more clever than random selection ?
My second question is about after creating training and test sets. I believe, I should apply K-Fold Cross validation since I have a limited data. However I don’t know how to build my model’s vocabulary in the case of K-Fold Cross validation. In general I believe, the vocabulary should be built based on training data only. However in the K-Fold Cross validation setting, the training set changes repeadetly and thus the model have a different vocabulary if I ignore the validation set while building vocabulary. What I wonder is that, even though it cause such a thing should I still ignore the validation set while building the vocabulary or can vocabulary be built by using combination of training and validation sets ?