Hi,
I’m trying to implement my version of a MDLSTM-powered OCR as described here but my training doesn’t work that well (even though the loss is decreasing I only predict blanks). I’ve nailed it down to a problem in reshaping the output of my MDLSTM layer.
As an input I have x.shape = (batch_size, in_channels, height, width) and as a raw input I have a List of height x width torch.Tensor, each having the shape (batch_size, out_channels).
What I did at first (and which turned out to be wrong) was a simple
torch.stack(hidden_states_direction).view((batch_size, self.out_channels, height, width))
Which does not yield the result I expect as it mixes the outputs together.
Let’s say having looped through my batch_size = 2 images of dimension (height=4, width=3) to produce 5 channels per pixel I have the following list
> [torch.rand(2,5) for x in range(12)]
[tensor([[0.1348, 0.0710, 0.9240, 0.9735, 0.7842],
[0.5013, 0.7658, 0.1962, 0.3333, 0.5991]]), tensor([[0.7568, 0.1716, 0.7395, 0.6700, 0.9601],
[0.2582, 0.0923, 0.9973, 0.3343, 0.9009]]), tensor([[0.5934, 0.1546, 0.1858, 0.5376, 0.7067],
[0.8049, 0.6795, 0.3320, 0.4425, 0.7615]]), tensor([[0.7460, 0.8905, 0.6827, 0.4476, 0.0609],
[0.0105, 0.5343, 0.5208, 0.4784, 0.1590]]), tensor([[0.4839, 0.6299, 0.4922, 0.0787, 0.8225],
[0.5834, 0.1190, 0.2080, 0.4747, 0.3932]]), tensor([[0.0718, 0.3101, 0.3698, 0.9139, 0.9014],
[0.5892, 0.1495, 0.6075, 0.5335, 0.0016]]), tensor([[0.5517, 0.9652, 0.5177, 0.7239, 0.0961],
[0.0418, 0.6147, 0.5976, 0.5161, 0.6613]]), tensor([[0.2002, 0.9971, 0.1544, 0.4872, 0.4858],
[0.3859, 0.7651, 0.6934, 0.2272, 0.8044]]), tensor([[0.2976, 0.6103, 0.3540, 0.8902, 0.8026],
[0.2527, 0.7077, 0.4747, 0.6466, 0.1127]]), tensor([[0.1744, 0.9700, 0.2080, 0.1341, 0.7072],
[0.9541, 0.4466, 0.0883, 0.5382, 0.3992]]), tensor([[0.5413, 0.7202, 0.6576, 0.0709, 0.3593],
[0.9235, 0.5360, 0.0626, 0.8336, 0.3391]]), tensor([[0.8896, 0.2198, 0.6680, 0.3688, 0.0389],
[0.8154, 0.5059, 0.9234, 0.9816, 0.5019]])]
Now when I want to put all that back in the form of (batch_size, self.out_channels, height, width) I get
>>> torch.stack(r).view(2, 5, 4, 3)
tensor([[[[0.1348, 0.0710, 0.9240],
[0.9735, 0.7842, 0.5013],
[0.7658, 0.1962, 0.3333],
[0.5991, 0.7568, 0.1716]],
[[0.7395, 0.6700, 0.9601],
[0.2582, 0.0923, 0.9973],
[0.3343, 0.9009, 0.5934],
[0.1546, 0.1858, 0.5376]],
...
When I would actually want something like
>>> torch.stack(r).view(2, 5, 4, 3)
tensor([[[[0.1348, 0.7568, 0.5934],
[0.7460, 0.4839, 0.0718],
[0.5517, 0.2002, 0.2976],
[0.1744, 0.5413, 0.8896]],
[[0.0710, 0.1716, 0.1546],
[0.8905, 0.6299, 0.3101],
[0.9652, 0.9971, 0.6103],
[0.9700, 0.7202, 0.2198]],
...
Any help would be greatly appreciated