FCNN + conv LSTM

So i basically have a fully connected Convolution network which is composed of an encoder and a decoder for object localization in an image .
Further i would have to add a Conv LSTM layer to make it work for a sequence of frames for a video.
How should i combine both of these and how would the input be fed into the Conv lstm layer and also what would be the input be to that layer?