What Sort Of NN Architecture Would Experts Recommend To Predict A Time-Based Video Layout?

I have a particularly interesting challenge I am trying to solve with a neural network.

My desired output it is a “prediction” of a physical and time-based layout. Here is a screenshot to make it understandable:

You can see here something that looks like a video editor. Each track has an image, and each track has attributes like: entryTime, exitTime, yPosition, xPosition, borderColor, scaleX, scaleY, etc.

I have a lot of training data… basically full video editor layouts, with all this data, for thousands of videos.

I am trying to build a model which will automatically create these layouts, given n number of images as inputs, and given that I can provide a lot of data about these images, and also providing a total duration value. I’m asking the model: “Here are some images, please do a layout of them on a video editor timeline.”

My first attempt has been to build a big, multi-layer feed-forward NN in pytorch, treating this as a multi-value regression problem, predicting one track at a time, with each track having about 18000 inputs.

My pytorch-lightning model looked something like this:

  class MyNet(pl.LightningModule):
    def __init__(self):
        super().__init__()  
        self.l1 = nn.Linear(18000, 7200)
        self.l2 = nn.Linear(7200, 5890)
        self.l3 = nn.Linear(5890, 3800)
        self.l4 = nn.Linear(3800, 2048)
        self.l5 = nn.Linear(2048, 1024)
        self.l6 = nn.Linear(1024, 41)
        self.loss = nn.MSELoss()

    def forward(self, x):
        h1 = nn.functional.relu(self.l1(x))  
        h2 = nn.functional.relu(self.l2(h1))
        h3 = nn.functional.relu(self.l3(h2))
        h4 = nn.functional.relu(self.l4(h3))
        h5 = nn.functional.relu(self.l5(h4))
        logits = self.l6(h5)
        return logits

As you can see, each track gets 18,000 values as inputs to the network.

Why so many?

Because I have been trying to have the neural network predict each track sequentially, i.e. ‘build’ each track, one by one, as a human might.

For each track, I provide information only about any previous tracks… so for input data for track 3, I show the model only data about tracks 0, 1, and 2… and set all the values of all the other tracks to 0.

This JSON file has both the “source” and “target” data for one track. Here I am trying to predict a track with an index of 4, so you can see that I provide data for the “previous” tracks, 0, 1, 2, and 3, but for indexes above 4, I provide 0.0 for all fields…

https://botdb-audio.s3.amazonaws.com/public/input_sample.json

You can see, at the bottom of the JSON files, that for my targets, I am trying to predict values like this:

 "target": {
    "timeoFEntry": 0.0576412841602547,
    "timeOfExit": 0.35904749270363484,
    "has-effect-Scroll": 1.0,
    "has-effect-Ken Burns": 0.0,
    "has-effect-Brightness": 0.0,
    "has-effect-Contrast": 0.0,
    "has-effect-Crop": 1.0,
    "has-effect-Filmgrain": 0.0,
    "has-effect-Film Noise": 0.0,
    "has-effect-Flip Flop": 0.0,
    "has-effect-Glitch": 0.0,
    "has-effect-Old Video": 0.0,
    "has-effect-Scanline": 0.0,
    "has-effect-Sepia": 0.0,
    "has-effect-Tv Noise": 0.0,
    "has-effect-Vhs": 0.0,
    "has-effect-Vignette": 0.0,
    "has-effect-Position": 1.0,
    "has-effect-Scale": 1.0,
    "effect-value-Scroll-effectAmount": 0.47,
    "effect-value-Ken Burns-effectAmount": 0.0,
    "effect-value-Brightness-effectAmount": 0.0,
    "effect-value-Contrast-effectAmount": 0.0,
    "effect-value-Crop-cropTop": -0.9933333333333333,
    "effect-value-Crop-cropLeft": -0.7697142857142857,
    "effect-value-Crop-cropRight": 0.7652857142857146,
    "effect-value-Crop-cropBottom": 0.8633333333333333,
    "effect-value-Filmgrain-effectAmount": 0.0,
    "effect-value-Film Noise-effectAmount": 0.0,
    "effect-value-Flip Flop-effectAmount": 0.0,
    "effect-value-Glitch-effectAmount": 0.0,
    "effect-value-Old Video-effectAmount": 0.0,
    "effect-value-Scanline-effectAmount": 0.0,
    "effect-value-Sepia-effectAmount": 0.0,
    "effect-value-Tv Noise-effectAmount": 0.0,
    "effect-value-Vhs-effectAmount": 0.0,
    "effect-value-Vignette-effectAmount": 0.0,
    "effect-value-Position-positionOffsetX": -0.4557735694472981,
    "effect-value-Position-positionOffsetY": -0.04835660377358461,
    "effect-value-Scale-scaleX": 0.389361145591566,
    "effect-value-Scale-scaleY": 0.9196339438734126
  }

You can see that these are values that could be used to reconstruct the full design/layout/timing of a particular track.

You can see that I am trying to both predict the existence of particular things (like has-effect-Filmgrain tries to predict if the track has the Filmgrain effect applied) … and also the values of certain things… the most critical values being these, which determine where each track appears in time and space:

"effect-value-Position-positionOffsetX": -0.05540686473250389,
"effect-value-Position-positionOffsetY": 0.20454731583595276,
"effect-value-Scale-scaleX": 0.7361692190170288,
"effect-value-Scale-scaleY": 0.747444212436676,
"timeOfExit": 9.890814781188965,
"timeoFEntry": 0.34969842433929443,

I would actually be happy just to be able to predict these six variables!

So I spent a ton of time on this, and in the end, I think I found that regression is probably not the right approach. The model was able to make SOME decent predictions, for example, it learned fairly well how to position tracks IN TIME … i.e., that track 1 might start at 0:00, run until 05:00, and etc. But it was unable to learn how to position the tracks in terms of scaleX, positionX, and etc.

I am now thinking that my “regression” approach is just wrong, and instead I need something more like a GAN network, like the kind that is used for music generation: Music generation with Neural Networks — GAN of the week | by Alexander Osipenko | Cindicator | Medium

OR – My problem is time based, so maybe I need something more like a time series model?

Also – to be clear, my input data also includes information about each image! Most of those 18,000 input values actually consist of information about each image… I showed the model a few different lists of floats for each image – each track input had vectors from resnet-18 GitHub - christiansafka/img2vec: 🔥 Use pre-trained models in PyTorch to extract vector embeddings for any image … and other image information, for example, I created 24 x 24 lists of ‘edge detections’, using code like this

def make_edge_detection(path):
    img = cv2.imread(path,0)
    edges = cv2.Canny(img,100,200)
    dim = (24, 24)
    resized = cv2.resize(edges, dim)
    flattened =  resized.flatten()
    as_list = flattened.tolist()
    normalized = list(map(lambda x: x/255, as_list))
    return normalized

My theory is that the model needs to ‘understand’ the images in order to know where to place them… and I also provided as much data as I could about each image, for example, aspect ratio and stuff like that.

Thanks in advance if any pytorch wizards can point me toward an approach I should try to make this work!