How to reshape the output of a sequence to fit an image grid

jleobernard · February 23, 2021, 10:22am

Hi,
I’m trying to implement my version of a MDLSTM-powered OCR as described here but my training doesn’t work that well (even though the loss is decreasing I only predict blanks). I’ve nailed it down to a problem in reshaping the output of my MDLSTM layer.
As an input I have x.shape = (batch_size, in_channels, height, width) and as a raw input I have a List of height x width torch.Tensor, each having the shape (batch_size, out_channels).
What I did at first (and which turned out to be wrong) was a simple

torch.stack(hidden_states_direction).view((batch_size, self.out_channels, height, width))

Which does not yield the result I expect as it mixes the outputs together.
Let’s say having looped through my batch_size = 2 images of dimension (height=4, width=3) to produce 5 channels per pixel I have the following list

> [torch.rand(2,5) for x in range(12)]
[tensor([[0.1348, 0.0710, 0.9240, 0.9735, 0.7842],
        [0.5013, 0.7658, 0.1962, 0.3333, 0.5991]]), tensor([[0.7568, 0.1716, 0.7395, 0.6700, 0.9601],
        [0.2582, 0.0923, 0.9973, 0.3343, 0.9009]]), tensor([[0.5934, 0.1546, 0.1858, 0.5376, 0.7067],
        [0.8049, 0.6795, 0.3320, 0.4425, 0.7615]]), tensor([[0.7460, 0.8905, 0.6827, 0.4476, 0.0609],
        [0.0105, 0.5343, 0.5208, 0.4784, 0.1590]]), tensor([[0.4839, 0.6299, 0.4922, 0.0787, 0.8225],
        [0.5834, 0.1190, 0.2080, 0.4747, 0.3932]]), tensor([[0.0718, 0.3101, 0.3698, 0.9139, 0.9014],
        [0.5892, 0.1495, 0.6075, 0.5335, 0.0016]]), tensor([[0.5517, 0.9652, 0.5177, 0.7239, 0.0961],
        [0.0418, 0.6147, 0.5976, 0.5161, 0.6613]]), tensor([[0.2002, 0.9971, 0.1544, 0.4872, 0.4858],
        [0.3859, 0.7651, 0.6934, 0.2272, 0.8044]]), tensor([[0.2976, 0.6103, 0.3540, 0.8902, 0.8026],
        [0.2527, 0.7077, 0.4747, 0.6466, 0.1127]]), tensor([[0.1744, 0.9700, 0.2080, 0.1341, 0.7072],
        [0.9541, 0.4466, 0.0883, 0.5382, 0.3992]]), tensor([[0.5413, 0.7202, 0.6576, 0.0709, 0.3593],
        [0.9235, 0.5360, 0.0626, 0.8336, 0.3391]]), tensor([[0.8896, 0.2198, 0.6680, 0.3688, 0.0389],
        [0.8154, 0.5059, 0.9234, 0.9816, 0.5019]])]

Now when I want to put all that back in the form of (batch_size, self.out_channels, height, width) I get

>>> torch.stack(r).view(2, 5, 4, 3)
tensor([[[[0.1348, 0.0710, 0.9240],
          [0.9735, 0.7842, 0.5013],
          [0.7658, 0.1962, 0.3333],
          [0.5991, 0.7568, 0.1716]],

         [[0.7395, 0.6700, 0.9601],
          [0.2582, 0.0923, 0.9973],
          [0.3343, 0.9009, 0.5934],
          [0.1546, 0.1858, 0.5376]],
...

When I would actually want something like

>>> torch.stack(r).view(2, 5, 4, 3)
tensor([[[[0.1348, 0.7568, 0.5934],
          [0.7460, 0.4839, 0.0718],
          [0.5517, 0.2002, 0.2976],
          [0.1744, 0.5413, 0.8896]],

         [[0.0710, 0.1716, 0.1546],
          [0.8905, 0.6299, 0.3101],
          [0.9652, 0.9971, 0.6103],
          [0.9700, 0.7202, 0.2198]],
...

Any help would be greatly appreciated

jleobernard · February 23, 2021, 10:32am

Well if anyone is interested I’ve find the fold solution :

fold = torch.nn.Fold(output_size=(4, 3), kernel_size=(1, 1))
fold(torch.stack(r, dim=2)))

This worked in the sample-case I described.