Contribution / Self-promotion: OCR Text/Image Dataset Package for YOU!

JosephC · April 29, 2023, 4:13pm

Hi folks,
I’ve been putting together a small Pypi package that takes a text dataset and an image dataset and squishes them together to do a faux OCR dataset. Initially this was done for some folks at LAION but I realized I should probably be showing it off to perhaps a wider group because (a) people here are more familiar with PyTorch best practices and (b) it’s probably useful to more than just two or three individuals.

It’s pip-installable right now as version 0.0.10. I expect there will be changes before 0.1.0.

tl;dr:

# %pip install text-overlay-dataset
from text_overlay_dataset import TextOverlayDataset
from PIL import Image

ds = TextOverlayDataset(
    image_dataset = [Image.new("RGB", (256, 256)), ], 
    text_dataset = ["Hello", "World"], 
    fonts="<path to ttf dir>"
)

composite_image, text, more = ds[0]

The “simple thing” is made to be easy. The “hard thing” is made to be possible. Here, just randomly generating augmented text overlays is pretty trivial, but the library also has support for random translations, rotations, quad distortion, and text blurring.

The font sizes are configurable and it’s possible to tell the library to favor larger fonts first.

All of the different configurations for the generator are in the docstring of the TextOverlayDataset class constructor.

My two biggest focuses for the immediate future are automatic text wrapping and streaming datasets. Right now it assumes both datasets have random access, but for big image datasets that’s not always doable and most of the examples in TorchText are streaming, too.

Here’s a screenshot from a notebook:

It’s MIT licensed and available on GitHub - JosephCatrambone/PyTorchTextOverlayDataset: A PyTorch Dataset Adapter to Composite Text and Images

vision data