I don’t know where else to ask and I am starting to be a bit desperate. If anyone knows a place where to ask this, please tell me. Anyways, I will try it here first.
I have trained a transformer network (multiple times). The dataset consists of sequences of a plant, which grows larger (sprouting to roundabout 10 cm). The background is very simple, and there are just two lights changing direction (coming sometimes from the left and sometimes from the right). The network has the task to predict the future plant shape, so how the next frame looks like, when he gets 4 frames as an input. This works ok, but still not perfect.Now I want the network to create new plants by starting with some original images of the plants in the dataset and start prediction. But the input to the network will be replaced with the output of the network, which was produced before. So it only predicts on already predicted frames.But by doing so, after like 30 images, some fragments from before are starting to evolve to unwanted things (for example, a second plant). And the plant that had to be predicted also becomes strange.
Has anyone experienced something similar? Or with other convolutional neural networks that produced strange output when predicting already predicted things?
I think the effects you are seeing would be expected as your model was trained with “real” or clean frames to predict the next frame while you are now feeding the predictions into the model to create new frames, which might diverge a bit in each step further from the real distribution.
I haven’t tried this approach in image generation, but I think that teacher forcing during training might be a valid experiment to run. It is (or was?) used in language models, which were suffering from similar effects, so could also work for your model.
To do so, you would also use the predicted frames of the model during training and feed a real image once in a while (I think you should play around with the frequency a bit, but refer to known implementations to check what worked before) to stabilize the training.
Thank you for your answer! Yes, I thought something similar before, so I’ve already trained the model with already predicted data. Not like you mentioned, but I put predicted and original sequences into the training data. Why would it help to mix the sequences? This is nothing that would happen when creating the dream, would it?
Chapter 10.2.1 - Deeplearning explains teacher forcing and in particular the last section might be interesting for you:
The disadvantage of strict teacher forcing arises if the network is going to be later used in an closed-loop mode, with the network outputs (or samples from the output distribution) fed back as input. In this case, the fed-back inputs that the network sees during training could be quite different from the kind of inputs that it will see at test time. One way to mitigate this problem is to train with both teacher-forced inputs and free-running inputs, for example by predicting the correct target a number of steps in the future through the unfolded recurrent output-to-input paths. […]
and also mention references to a curriculum learning strategy.
Generally, I would skim through the literature and check how (text based) sequence models were dealing with these issues and try to adapt these techniques to your image sequences.
I want to do exactly this but with another use case, how did you do that? I don’t know how to make a transformer to predict the next frame in the sequence. I’m a bit stuck to be honest