LSTM train max length vs length at prediction time

Does LSTM train max length restriction carry over to restrictions for inference as well? That is can I expect an lstm model trained over sequences of max length 100 to give good results well for sequences of length 200?

It is possible, but not guaranteed. Simply speaking, if distributions of your training vectors (both x and y) are stationary over time, extrapolation into future should be good, otherwise - it is dubious, but may work, at least for some number of extra time steps. GRU may fare better in this regard, because if has no separate ‘forget’ and ‘update’ gates, that can cause state drift over time.

Do you think it’s ok to have the timestamp as part of the features for every timestep where difference between timestamps of two timesteps is not exactly the same.

Irregular time series are not handled that well by RNNs in general, if disparity is big. Timestamps are not good, as they’re absolute; even if you shift to zero base - you’ll get position dependent forecasts, inferior beyond your training length. Time differences (like numpy.diff) should work better, another approach is adding “periodic” features {sin,cos}(2pit/period).

Do you think it makes sense to just move my set up to transformers?

It depends. Look at what approaches are used in your problem domain. Irregularity issue can be tackled to some extent with resampling, and transformers are more “lookup” than “forecast” models (i.e. with bigger internal memory), so their applicability scenarios are somewhat different.