For some reason I tried feeding random binary sequences into a small transformer and training the model to predict the value at position k+1 given the transformer output at position k. The sequences are summed with learnt positional embedding, with the aim that the model learns to query the input sequence in the right position.
However this does not work at all and the model does not learn anything.
I tried hunting for bugs with no success.
I tried many hyperparams with no success.
I tried predicting value at position k from outputs at position k and it works easily (residual connection probably)
I tried replacing the Transformer with a feedforward net and it works easily.
Can’f find my bug! or is this a difficult task for a transformer? sounds weird
The code can be found here: TransformerBuffle/train.py at main · yotam-happy/TransformerBuffle · GitHub
Would greatly appreciate if anyone spots my bug or has some explanation