I am migrating to pytorch from keras.
I am having trouble understanding the seq2seq learning tutorial here.
I am following this tutorial as it resembles most with my task at hand.
Now, a very very basic thing in ML, you got to split the data in training and testing.
Your algorithm should not see the data that you are going to test it on,
This example, though very detailed and nicely written, never splits the data, at least that is my understanding.
Later on it tests the learning by randomly drawing the sentences from the data to test upon.
So we never have a measure of how algorithm is behaving, right?
what am i missing here? do we not need to split the data for training and testing?
how do we split it in the context of this example?
In fact, i think its a bullshit tutorial if the purpose was to make available a small end to end language translation algorithm.
I went ahead and edited it to split the data between training and testing. As of now, no cross validation, just a sweet 80-20 split.
the program is not capable of handling this kind of change, as the test split may contain a word which is never seen before, hence it cannot find the index, and program crashes.
I must open a new question, as it seems a different arena altogether.
If you know what should be done, like adding the newly seen words to the list of indices, or some other approach, please let me know.