Skip to content

Add preprocessing pipeline for NavOCR dataset#26

Draft
exceldra5 wants to merge 1 commit into
mainfrom
preprocessor_v2
Draft

Add preprocessing pipeline for NavOCR dataset#26
exceldra5 wants to merge 1 commit into
mainfrom
preprocessor_v2

Conversation

@exceldra5
Copy link
Copy Markdown
Collaborator

Goal

NavOCR preprocessing code

  • CLIP based exterior filtering
  • OCR every text in filtered text
  • Match text with levenshtein distance (utilize DeepL API)

Environment

Branch / Commit: preprocessor_v2
Language / Framework: python
Related Branches: crawl_v2, backup-private-ver

Code Update

  • backup-private-ver based code update
  • use CLIP prompt in backup-private-ver branch
  • Write the executable code using the console_scripts approach in the main branch.
  • Configure additional dependencies using extras_require in setup.py.
# 1. Install extras for preprocessing and translation
pip install -e ".[preprocess,translate]"

# 2. Install PaddleOCR (manual step — difficult to automate due to CUDA variations)
pip install paddlepaddle paddleocr        # Apple Silicon / CPU
# or
pip install paddlepaddle-gpu paddleocr    # NVIDIA GPU
  • env check code
python scripts/check_preprocess_env.py
image

Paper Feedback

No. Item Paper §3.1 Code
1 CLIP prompt Single prompt: "an exterior photo of the place or a photo that includes a signboard" 3-pass cascade using 3 different prompt sets
2 CLIP threshold semantics Single threshold: 0.7 3-pass × 0.7 (AND semantics)
3 Levenshtein threshold 0.5 0.5
4 DeepL translation fallback None Re-matching after Korean↔English translation
5 Number of images/queries 20 images Using 30 images
6 Label format ("file_name", (x₁,y₁,w₁,h₁), …, (xₙ,yₙ,wₙ,hₙ)) simple tuple COCO JSON (images, annotations, categories)
  • The CLIP prompt was applied in a 3-stage process as described below, using both positive and negative prompts for classification.
PROMPT_SETS: list[tuple[str, str]] = [
	(
		"a photo that are not contain signboard or a photo of a signboard against a plain background",
		"a photo of a store exterior",
	),
	(
		"a photo that are not contain signboard or a photo of a signboard against a plain background",
		"a photo of a store exterior with a signboard and surrounding environment",
	),
	(
		"a online logo of a brand",
		"a photo of a store exterior with a signboard and surrounding environment",
	),
]
  • Considering whether DeepL-based translation is necessary.
  • In Figure 2, changed the 1st filter from CLIP to OCR, and the 2nd filter to OCR accordingly.
  • Changed the label format from a simple tuple (as in the paper) to the provided COCO JSON-based format.

ToDo

  • Enhanced code annotations.
  • Conducted additional testing.
  • Implemented web crawling code.
  • Incorporated licensing considerations during crawling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant