Neural Style Transfer on videos

I would like to implement an architecture similar to this:

Characterizing and Improving Stability in Neural Style Transfer, Gupta, A. and Johnson, J. and Alahi, A. and Fei-Fei, L.

It is a Recurrent Convolutional Neural Network. The light blue box is a simple convolutional neural network and the rest of structure makes the network recurrent. The authors use a sequence of 10 frames long that gets unfolded in 10 steps. The network gets fed with the current frame and the previous stylized frame (the frame generated on the previous step).

I have a working implementation of the feedforward architecture (the light blue box in the picture) and I would like to transform it in a Recurrent Convolutional Neural Network. Unfortunately I could not find much about the topic in the Pytorch community.

I have two questions in order to transform a Convolutional Neural Network in a RCNN:

  1. How can I prepare the dataset of frames in order to feed them to the RCNN? Should I make a Dataset class that returns a sequence of frames?

  2. How can I unfold this sequence of frames in order to use back propagation through time? I read the Pytorch documentation and saw that I can not use a RNN, LSTM, GRU layer in this particular case but I should write the recursion myself.

I would much appreciate if you have suggestions, pointers, tutorials, videos I can take a look at in order to understand this part.

1 Like

Perhaps you could summarize the contents of the papers? I’m guessing people on the forum want to help, but dont have the time to read 2 papers before answering!

Thanks for your suggestion, I edited my question. I hope now it is clear and easy to read what I would like to achieve/implement.

At the moment I have something like this:

# Loop for the epochs
for epoch in range(1, args.epochs + 1):
        model.train()
        epoch_loss = 0
        # Loop through sequences of frames returned by the Dataloader
        for i, sequence in enumerate(training_data_loader):
            optimizer.zero_grad()
            loss = 0
            im_out = i % args.image_freq == 0

            # I initialize the previous prediction as a black image
            prev_est = torch.zeros(sequence['input'].size(0), 3, sequence['input'].size(3), sequence['input'].size(4)).cuda(args.gpu, non_blocking=True)
            # Loop over the sequence of 10 frames, I pass to the network 2 frames (t-1, t)
            for j in range(sequence['input'].size(2) - 1):
                inputs = torch.index_select(sequence['input'], 2, torch.tensor([j,j+1])).cuda(args.gpu, non_blocking=True)
                t = torch.squeeze(torch.index_select(sequence['target'], 2, torch.tensor([j+1])), 2).cuda(args.gpu, non_blocking=True)
                output, l = model(inputs, prev_est, t, i, writer, im_out)
                loss += l 
                # Set the previous estimate as the output from the network
                prev_est = output   
            
            epoch_loss += loss.item() / sequence['input'].size(2)
            # Here I should achieve the backpropagation through time but I am not sure is doing it correctly
            loss.backward()
            optimizer.step()

            writer.add_scalar('learning_rate', args.lr , total_iter)
            writer.add_scalar('train_loss', loss.item(), total_iter)

            print("===> Epoch[{}]({}/{}): Loss: {:.4f}".format(epoch, i, len(training_data_loader), loss.item()))
            total_iter += 1

        print("===> Epoch {} Complete: Avg. Loss: {:.4f}".format(epoch, epoch_loss / len(training_data_loader)))

The train_data_loader provides a tensor containing a sequence of 10 frames. Is it correct the way I am doing BPTT?

Yeah, tough question but an interesting one.

I guess you need to create an “unrolled” network with gradients flowing through it… sum your loss at each timestep, then apply the loss to the output of the final timestep. Then, as long as the gradients are all attached, theoretically they should flow back through “time”.

This post implies you can achieve that by using the same variable for input and output. Maybe give that a shot?

# non-truncated
for t in range(T):
   out = model(out)
out.backward()

# truncated to the last K timesteps
for t in range(T):
    out = model(out)
    if T - t == K:
        out.detach()
out.backward()

Thanks for your reply, I found the conversation you posted in an older forum post and I tried to follow it.

I put some code of my training procedure so far. The network is training but I have the feeling that the gradient does not get propagate through time.

Cool, you can use hooks, to inspect your gradients.

I have a question about the “warp” function in pytorch in my experiments.
If I want to warp an image from optical flow, could I grid_sample() and affine_grid() to achieve warp function?

Hey Qing, look at this post.

warp layer example

You can look at that code and see how they did it.

1 Like

Thanks a lot. I would read this post immediately~

Got it. I will see how they did. Thanks for your reply~

hi @riccardosamperna, if I want to warp image1 to image2, the input optical flow(acquired from FlowNet) is from image2 to image1 rather than image1 to image2 ?

I don’t really understand your question and you should probably open another discussion but let’s see if I can help you.

If you have image1 and image2 and you want to warp image1 to image2, you calculate the optical flow between image1 and image2 and you use it to warp image1. If you the flow from image2 to image1, the flow from image1 to image2 is just the opposite.

Thanks for your reply. I deploy the warp function using pytorch. But when I use the optical flow generated by FlowNet2.0, the generated image using the warp function has little significant changes. I suppose there maybe some problem in optical flow or warp~ I will check the function carefully later.