AnyGenOCRData

Gen lines text image for ocr or pdf-parse

What is it for?

Are you still struggling with the lack of data for OCR recognition and PDF parsing?

This tool is designed to generate multiple lines text images for OCR and PDF parsing.

Support multiple themes, multiple code formats, and multiple languages

How to use?

see demo

How to install?

1. see https://wkhtmltopdf.org/ install it

2. pip install pygments imgkit pillow

from PIL import Image
from anygenocrdata import AnyGenOCRData

model = AnyGenOCRData()


## gen chinese text

text1 = f"""
欢迎来到 gpt-oss 系列，OpenAI 的开放权重模型 旨在提供强大的推理能力、代理任务和多样的开发者使用场景。

我们发布了这两种开放模型：

gpt-oss-120b — 适用于生产环境、通用用途和高推理需求的场景，可以放入单个 80GB GPU（如 NVIDIA H100 或 AMD MI300X）中（117B 参数，5.1B 活动参数）
gpt-oss-20b — 适用于低延迟和本地或特定用途的场景（21B 参数，3.6B 活动参数）
这两种模型都经过了我们的 和谐响应格式 训练，仅应与和谐格式一起使用，否则将无法正常工作。
"""

model.invoke(
    content = text1, 
    htmlfile = './assets/1.html', 
    imgfile = './assets/1.png', 
    file_suffix = 'txt'
)
Image.open('./assets/1.png')


## gen english text

text2 = f"""
Welcome to the gpt-oss series, OpenAI’s open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases.

We’re releasing two flavors of these open models:

gpt-oss-120b — for production, general purpose, high reasoning use cases that fit into a single 80GB GPU (like NVIDIA H100 or AMD MI300X) (117B parameters with 5.1B active parameters)
gpt-oss-20b — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters)
Both models were trained on our harmony response format and should only be used with the harmony format as it will not work correctly otherwise.
"""

model.invoke(
    content = text2, 
    htmlfile = './assets/2.html', 
    imgfile = './assets/2.png', 
    theme = None,
    file_suffix = 'txt'
)
Image.open('./assets/2.png')


## gen code

text3 = f"""
CREATE TABLE Beds (State VARCHAR(50), Beds INT); 
INSERT INTO Beds (State, Beds) VALUES ('California', 100000), ('Texas', 85000), ('New York', 70000);
"""

model.invoke(
    content = text3, 
    htmlfile = './assets/3.html', 
    imgfile = './assets/3.png', 
    theme = None,
    file_suffix = 'sql'
)
Image.open('./assets/3.png')

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md
anygenocrdata.py		anygenocrdata.py
examples.ipynb		examples.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AnyGenOCRData

What is it for?

How to use?

How to install?

Samples

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AnyGenOCRData

What is it for?

How to use?

How to install?

Samples

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages