Reshaping output to fit In CTC loss

Hi fellows,

I have a doubt. I am working on 2D Cnn network for OCR. After my 6th CNN layer output, tensor shape will be (B, C, H, W). I have to pass this output to linear layer to map to number of classes(76) required to have for CTC loss. Now how should i reshape my CNN output tensor to pass to linear layer. Also after linear layer I have to pass to softmax and CTC which requires 3d tensor. how should I reshape from linear layer output to pass to CTC.
CNN output:(64, 512, 1, 28). Now this tensor I have to pass to linear layer. at the end output channels shold be 76 in my case. If I flatten it using view (B, -1), then it should look like (64, 2851228), and the end output will look like (64, 76), now how should I pass to CTC because I have no time steps information. Please help.

There are multiple possible approaches and it depends how the activation shape is interpreted.
E.g. using [64, 512, 1, 28] you could squeeze dim3 and use dim4 as the “sequence” dimension (it’s one of the spatial dimension).
In this case, you could permute the activation so that the linear layer will be applied on each time step and permute it back to the expected shape for CTCLoss.
Something like this could work:

B, C, H, W = 64, 512, 1, 28
x = torch.randn(B, C, H, W)

x = x.squeeze(2) # remove H dim
x = x.permute(0, 2, 1).contiguous() # permute to [B, H, C=T]
T = x.size(1)

linear = nn.Linear(C, 76)
out = linear(x) # will be applied on each timestep (same as looping over dim1)
print(out.shape)
# > torch.Size([64, 28, 76]) # corresponds to [B, T, out_features]

criterion = nn.CTCLoss() # expects an input in [T, N=B, C]
out = out.permute(1, 0, 2).contiguous() # permute to [T, B, C]
out = F.log_softmax(out, dim=2) # create log probabilities

# create random target and lengths
target = torch.randint(1, C, size=(B, 10))
input_lengths = torch.full((B,), T, dtype=torch.long)
target_lengths = torch.randint(1, 10, (B,))

# calculate loss
loss = criterion(out, target, input_lengths, target_lengths)
loss.backward()

Hello, Thanks for your answer. I have some doubts still.
What was the motivation to choose this interpretation and choosing dim 4 as sequence dimension?
In Line 4, after you have squeezed H dimension in line 3, it is still mentioned in the comment, also, why C = T? Shouldn’t it be W = T?
Also, shouldn’t input to linear layer be in shape of 2D tensor? Shouldn’t we flatten it? You have passed a 3D tensor to linear layer. Will it work?

One idea could be that the spatial dimensions are comparable to a temporal dimension in that both are encoding a sequential information (i.e. neighboring values might be correlated).

Yes, you are right. My comments are wrong, T = x.size(1) should still work as it would assign W to T.

Yes, it will work as described in the docs. Linear layers accept an input of [batch_size, *, in_features] where * denotes additional dimensions. They will be applied on each sample in the additional dimensions as if you were using a for loop.

1 Like

Hello! Recently I ran several experiments on OCR with CTC for my own research, and I would suggest you to try the following:

Instead of compressing some of the cnn output dimensions to one, in your case H is squeezed to 1, I would recommend to change CNN’s kernel_size to get wider dimension - thus you can use some pooling/attention operation over it:

# let input image size is (b, c, h, w) = (16, 3, 64, 1024)
# let cnn output size is (16, 2048, 128, 512)
# then

pooling =  nn.AdaptiveAvgPool2d((512, 512))

out = out.view(16, 2048, 128*512)
out = pooling(out)

Of course you can use classical maxPooling, or some custom self-attention, but in my experiments with IAM/Bentham I found the averagePooling best.

You can also try to use different dimension to multiply, for example (16, 2048*128, 512), etc.
Try to find your best.

Speaking about some background behind this, well, I think that we getting much wider feature map by keeping dimension not-equal-to-one, and by aggregation we can reduce the overall size, but also extract valuable feats.

1 Like

Thank you I will try this for sure. You do this avgPool after the last layer before passing to CTC? Not between every CNN layers right? My network right now doesnt have any pooling layers. Its pure CNN network and after that 1 fc layer before CTC

Hi. I want an advice. My mentor said to change the ways i feed the cnn outputs to CTC. He said :- For each of the 100 timestamps of CTC, feed three spatial coumns from CNN ouputs, instead of 1 so that each CTC timestamp sees three spatial neighboring columns instead of one. I think CTC accepts the whole CNN ouputs. Is slicing can be done to cnn outputs to feed to CTC? What’d you guys think?

Aggregation was made once after the cnn before passing to dense layer.

Hi Stannis,
My input size is 64, 1, 32, 100

Let say by increasing padding and decreasing kernel size to 2 from 3, uplift my dinesions to 64, 1, 64, 128. and then doing adaptive avg pooling with kernel size(256, 256). Will this be alright?

Also Why did you multilpy ypur spatial dimesnsions before feeding to pooling. Cant it be done without multiplying?

Thanks

Hello there, the model flow could be presented as follows:

CNN → multiply → pooling → linear → log_softmax

Speaking about multiplying: referring to the docs (link), pooling input could be 4d or 3d, but hence the ctc loss input has to be 3D, its better to first multiply dimensions, and then apply pooling to them - thus the relatively rough multiplication operation will be ‘smoothed’ by pooling.