Convert Text data to type which can be standartized

Hi, i want to convert text data where it will be possible to standardize it. This is my Desired Pipiline:

1)convert data to the way it will be possible to standardize.
2)apply standardized data to Linear Discriminant Analysis for Dimensionality Reduction.
3)After LDA apply it to CountVectorizer
4)then split and train with Traditional ML Algorithm.

Is it a correct pipeline? Right now i’m stuck at the first step. how Should i convert from string to digits that it will be possible to standardize?

Or may be there is an easier way to do that with different pipeline ?

All tutorials are on IRIS dataset, where every column is continues digits data. not text.

Have you tried this tutorial?

https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html

By the way, I found one of the definitions should be changed here:

from torchtext.vocab import vocab # note lower case "vocab"

def build_vocab(filepath, tokenizer):
  counter = Counter()
  with io.open(filepath, encoding="utf8") as f:
    for string_ in f:
      counter.update(tokenizer(string_))
  return vocab(counter, specials=['<unk>', '<pad>', '<bos>', '<eos>'])  # note lower case "vocab"

Alternatively, you can use generate_sp_model:

https://pytorch.org/text/stable/data_functional.html

That will give you a way to produce an efficient sentencepiece model and go back and forth between strings and tokens.

Here is the original paper:

Thanks i’ll check those.

One question though, while training the ML model with Kfold cross-validation. Is it okay to dump trained model at every SKF split to use that in testing ? or how do i use Kfold trained model? i want Earlystopping in Traditional ML

Are you limited on data? Having a dedicated validation set would be better than Kfold, imo, if not.

I’m not sure what you mean by “dump” the model.

It’s good practice to make checkpoints between each epoch and a log of any running stats. Then you can choose which set of parameters were best after the fact.

When you want to run validation data, just be sure to use these bits of code:


with torch.no_grad():
    model.eval()
    #test and get loss or other stats here
    #good spot to make a checkpoint with stats log
    model.train()

#continue training

Yes, i meant checkpoints. but i did the project in Traditional ML. so that’s why i asked if it’s ok to save checkpoints after every k-fold training.

Thank you so much. I already finished that project.