Architecture suggestion for time series data

I have a time series dataset (spatiotemporal, but not an image/video). The dataset is in 3D, where each (x,y,t) coordinate has a numeric value (such as the elevation of the sea at that location and at that specific point in time). So we can think of it as a matrix with a temporal component.

I need to predict/forecast the future (next few time steps) values for the whole region (i.e. all x,y coordinates in the dataset). I was thinking of using a ConvLSTM or CNN-LSTM but most of the posts online seem to be applied to video frame prediction. Since it’s not a video, I only have 1 channel for each time instance.

Can you all suggest an architecture that would be a good fit for my purpose? Thanks!