Best way to preprocess image


I am trying to decide what is the best way to preprocess an image that I have. Basically, imagine a photo of a table that has 10 rows and 3 columns. Each cell contains a handwritten digit. I built a MNIST model and retrained with my custom data. But my question is, what is the best way to pass in such input? I want to classify each row numbers for example:

| 5 | 6 | 0 |

| 0 | 1 | 2 |

Each row is a 3 digit number. I was thinking of since image size is fixed, divide into each row then each row into a single digit and pass into model for inference then reconstruct results. I looked into using a pretrained SVHN model but the data differs from handwritten digits.

Thank you for any feedback!

This sounds like a reasonable suggestion. Especially, if your input images follow a specific pattern and you could easily isolate each digit.

What kind of model did you look at and how is the input different?

1 Like

@ptrblck I appreciate you taking the time to reply to me! The model I looked into was trained on SVHN dataset (street view house numbers). It can do up to 5 digit recognition in one go but the problem with that is the data used has bounding boxes configured and they are not handwritten digits. If I were to build something like that, I would have to do lots of data preprocessing. I thought going with a good MNIST model would be far easier. Only thing needed was how to process the image as I mentioned.

Thank you again,

Thanks for the information.
I would try to avoid the preprocessing with bounding boxes, as this would give you more flexibility but would also add more complexity to the code, which is apparently not needed for your use case.
Since the digits are located in the same zones, you could write a slice_image method, which could use passed size to return a batch of digits from a single image.

While there are many ways to write this code, here is a small example using unfold:

# create input tensor
x = torch.stack([torch.ones(32, 32) * idx for idx  in range(16)])
x = x.view(4, 4, 32, 32).permute(0, 2, 1, 3).contiguous().view(4*32, 4*32)

def slice_image(image, size):
    batch = image.unfold(0, size, size).unfold(1, size, size) # [4, 4, 32, 32]
    # flatten to [16, 32, 32]
    batch = batch.contiguous().view(-1, size, size)
    # add channel dim
    batch = batch.unsqueeze(1)
    return batch

Note that you would create a batch of 16 samples in this way, so the batch_size in the DataLoader would be a multiplicative factor and you would need to flatten the batch dimensions in your training loop:

for data in loader:
    # data.shape is [batch_size, 16, 1, 32, 32]
    data = data.view(-1, 1, 32, 32)
    # data.shape is [batch_size*16, 1, 32, 32]

I am sorry, but correct me if I understood this correctly. You created a random “image” and then batch each square as a image ready to load for inference?

I created a fake image input, which contains 16 smaller blocks, each in the shape 32x32.
This would correspond to the input you are already dealing with.
The slice_image method uses unfold to create a tensor of [16, 1, 32, 32] which can be called e.g. in Dataset.__getitem__.
Since the Dataset already returns a batch of images (of the “small” 32x32 images), you would have to reshape the final data tensor, if you are using a DataLoader, since it would add another batch dimension to it.

1 Like