Why do we train with the mask and then make predictions with it during inference (the ladder idea is clearer). Why can’t we train the model and create a mask such that it extends an initial period?
Ex:
# mask would be the same dimensions except a different length, and filled with 0
torch.cat(previousTimeTensor, mask)