For example, I’m training a model for joint video and voice recognition with mini-batches, using RNN. The model is like this
- Path 1 (Video RNN) / \ Data (batches contains 3 videos) Fusing -> Result \ / - Path 2 (Voice RNN)
However, because some of the videos do not have voice information (no voice recorder), the input of the voice path is all 0s. But I still need to feed the network both the video and (an all-zero) voice data. It’s obvious that the all 0s input in Path 2 will harm the final result when fusing the two paths together. I’m wondering whether there is a way to block the voice path (Path 2) in training based on whether the voice information is available or not? Note that if the Path 2 is disabled, the voice feature will be assigned as all 0s during fusion instead of some random generated by the layers in Path 2 (as the response of an all 0 input vector). It is also important to mention that the network should support mini-batch as input.