I’m trying to modelling such a complicated type of sequence tagging using PyTorch in which corresponding tags of input sequence could change for a certain word in another sentence.
I have coded rest of job, but still something is confusing for me regarding original docs of PyTorch,
suppose we use a stacked Bi-LSTM, which is batch-first format.
The objective is to tag sequence (w1, w2, …, wn) to desired tag labels (t1, t2, …, tn) using both character-level and word-level embedding. the word-level LSTM should take concatenation of words and their chars representation as its input. The words itself are independent from each other, but in some cases, tag of the next word (t_n+1) yields to change its previous word’s tag (t_n) regarding to some other features called sequence meter. So in this case, for a certain word [w], it’s going to have different tags, which will tends to model be over-fitted.
My question is that, is this type of architecture capable enough to extract such “dependent” features of the input sequence or do we should adding up another features, e.g., bi-gram representation?
lots of thanks