I’m working on integrating an existing implementation of a recurrent visual attention model with a video dataset (EPIC Kitchens), and I’m struggling to get this model to run with more than 1 video in a batch at a time.
Background: the original model I am trying to overhaul is as recurrent model that focuses on different parts of MNIST images to classify them. For each image the model takes a ‘glimpse’ of the input image, which is essentially a cropped out image patch, and combines it with the glimpse location to produce the hidden state RNN vector. The hidden state is then used to choose the location of next image patch, or to make a prediction about the image class. Essentially the network behaves more human-like by focusing on different parts of the image based on what it has seen so far.
I’ve successfully made the network run on videos from EPIC Kitchens, with each video frame receiving one glimpse, but the training is incredibly slow. This is because of very slow data loading. I expected training to be quite slow given the video format and the fact that I don’t have an SSD (will get one at some point), but it’s too slow. I need to run experiments fairly rapidly to upgrade the architecture for the requirements of EPIC Kitchens. I figured it’s largely due to having set batch size to 1, and here is where my current problem comes in. By default PyTorch won’t accept batches of variable length video clips, and it will throw an error at data loading. Luckily, there’s a class called PackedSequence for ‘packing’ variable length sequences by padding them with 0s and all of PyTorch RNN modules ignore padding so as to be more computationally efficient.
That’s all neat, but I don’t use the Pytorch’s RNN modules since my model is custom and I don’t understand how to integrate PackedSequence into my model. I’m looking at the source code for modules.rnn and it’s not helping at all (in the definition of forward() for the RNNBase class what does hx even stand for?). I’m also worried about VRAM usage since the longest video clip in the subset I’m working on is an order of magnitude longer than the average clip and it could cause the batch to crash training.
Can anyone hint me at how to integrate PackedSequence into my model? If that’s not possible or viable, can anyone guide me towards a more efficient data loading paradigm/code for my problem? I’m beginning to doubt the batch loading paradigm as it’s very ill suited for my problem.