Add preprocessing pipeline for NavOCR dataset by exceldra5 · Pull Request #26 · kc-ml2/NavOCR

exceldra5 · 2026-05-05T22:44:25Z

Goal

NavOCR preprocessing code

CLIP based exterior filtering
OCR every text in filtered text
Match text with levenshtein distance (utilize DeepL API)

Environment

Branch / Commit: preprocessor_v2
Language / Framework: python
Related Branches: crawl_v2, backup-private-ver

Code Update

backup-private-ver based code update
use CLIP prompt in backup-private-ver branch
Write the executable code using the console_scripts approach in the main branch.
Configure additional dependencies using extras_require in setup.py.

# 1. Install extras for preprocessing and translation
pip install -e ".[preprocess,translate]"

# 2. Install PaddleOCR (manual step — difficult to automate due to CUDA variations)
pip install paddlepaddle paddleocr        # Apple Silicon / CPU
# or
pip install paddlepaddle-gpu paddleocr    # NVIDIA GPU

env check code

python scripts/check_preprocess_env.py

Paper Feedback

No.	Item	Paper §3.1	Code
1	CLIP prompt	Single prompt: "an exterior photo of the place or a photo that includes a signboard"	3-pass cascade using 3 different prompt sets
2	CLIP threshold semantics	Single threshold: 0.7	3-pass × 0.7 (AND semantics)
3	Levenshtein threshold	0.5	0.5
4	DeepL translation fallback	None	Re-matching after Korean↔English translation
5	Number of images/queries	20 images	Using 30 images
6	Label format	("file_name", (x₁,y₁,w₁,h₁), …, (xₙ,yₙ,wₙ,hₙ)) simple tuple	COCO JSON (images, annotations, categories)

The CLIP prompt was applied in a 3-stage process as described below, using both positive and negative prompts for classification.

PROMPT_SETS: list[tuple[str, str]] = [
	(
		"a photo that are not contain signboard or a photo of a signboard against a plain background",
		"a photo of a store exterior",
	),
	(
		"a photo that are not contain signboard or a photo of a signboard against a plain background",
		"a photo of a store exterior with a signboard and surrounding environment",
	),
	(
		"a online logo of a brand",
		"a photo of a store exterior with a signboard and surrounding environment",
	),
]

Considering whether DeepL-based translation is necessary.
In Figure 2, changed the 1st filter from CLIP to OCR, and the 2nd filter to OCR accordingly.
Changed the label format from a simple tuple (as in the paper) to the provided COCO JSON-based format.

ToDo

Enhanced code annotations.
Conducted additional testing.
Implemented web crawling code.
Incorporated licensing considerations during crawling.

Add preprocessing pipeline for NavOCR dataset

ed8cf3e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add preprocessing pipeline for NavOCR dataset#26

Add preprocessing pipeline for NavOCR dataset#26
exceldra5 wants to merge 1 commit into
mainfrom
preprocessor_v2

exceldra5 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

exceldra5 commented May 5, 2026

Goal

NavOCR preprocessing code

Environment

Code Update

Paper Feedback

ToDo

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant