How to optimize variable input, variable output network?

Current optimization techniques, that I’m aware of, are not applicable to the wide range of networks that are built for image-to-sequence or sequence-to-sequence problems where both image dimensions and output sequences can vary. Some applications that can utilize such models are:

  1. image captioning
  2. speech-to-text and vice versa
  3. OCR
  4. image-to-speech (e.g. describing photos to blind or almost blind people)

Such models can for example be built from fully convolutional backend (spatial or temporal - it doesn’t really matter), after which some attention and RNN decoder are applied. Additional complexity with variable sized input comes from the fact that real user data cannot be efficiently batched all the time due to large difference in input shapes (huge padding introduces too much wasted computation).

As specified here: What does torch.backends.cudnn.benchmark do?, cudnn benchmark which does wonders for some class of models doesn’t work. Switching to float16 inference might give minimal boost on Volta+ architectures, but TensorCores require some strict shape restrictions to be properly utilized and that might be problematic with variable batch dimension. Exporting and converting to TensorRT network is again problematic for variable in/out shape networks.

Am I missing something or there are no good general strategies for optimizing these and alike models?