[FEAT] marker PDF layout extraction job by priyankeshh · Pull Request #155 · Extralit/extralit

priyankeshh · 2025-09-14T14:06:49Z

Summary

I have successfully implemented a comprehensive Marker layout extraction system for the extralit-server. This implementation provides optimized PDF layout detection capabilities that can identify and extract tables, figures, and text blocks without requiring full OCR processing.

✅ What Was Implemented

Core Features:

Async/Sync Job Support - Full RQ job queue integration with both async and sync wrappers
Multi-Element Extraction - Robust extraction of tables, figures, and text blocks with bounding box coordinates

Files Created/Modified:

1. Main Job Implementation (ocr_jobs.py)

async_marker_layout_job() - Main async job function with full RQ integration
marker_layout_job() - Synchronous wrapper for RQ compatibility
_call_marker_layout_detection() - Core Marker API integration using ConfigParser and PdfConverter
Comprehensive error handling and job metadata tracking

2. Enhanced Utility Functions

tables.py - Robust table extraction with multiple format support
figures.py - Comprehensive figure extraction with enhanced validation
text.py - Advanced text block extraction with subtype detection

Key Technical Features:

Marker Integration:

Uses PdfConverter with ConfigParser for optimal configuration
Supports create_model_dict() for proper model initialization
Handles page-range processing for large documents
Markdown output format for reliable block detection

Robust Data Processing:

Multiple naming convention support (bbox, coordinates, bounding_box, etc.)
Various data format handling (list, dict, nested structures)
Block type detection with extensive keyword matching
Confidence score extraction with fallbacks

🚀 Usage Example

# Extract layout from entire document
result = marker_layout_job("document.pdf", extract_text=True)

# Extract layout from specific pages
result = marker_layout_job("document.pdf", pages=[0, 1, 2])

# Result structure
{
    "tables": [{"page": 0, "bbox": [x1, y1, x2, y2], "score": 0.95, ...}],
    "figures": [{"page": 1, "bbox": [x1, y1, x2, y2], "caption": "...", ...}],
    "text_blocks": [{"page": 0, "bbox": [x1, y1, x2, y2], "content": "...", ...}],
    "metadata": {"source": "marker", "total_elements": 15, ...}
}

🔄 Future Enhancement Opportunities

API endpoint exposure for HTTP access
Batch processing support for multiple documents
Performance metrics and monitoring integration
Advanced configuration options for model tuning

- Completely replaced _call_marker_layout_detection with real marker-pdf integration - Removed _get_mock_marker_output function in favor of actual API calls - Updated utility functions to handle real Marker block types: - Tables: now handles 'table' and 'tablegroup' types - Figures: expanded to include 'picturegroup', 'figuregroup' types - Text: added support for 'sectionheader', 'textinlinemath', 'listitem', etc. - Enhanced metadata extraction with polygon support - Added marker-pdf as optional dependency in pyproject.toml - Implements proper JSON output format with hierarchical block structure - Maintains backward compatibility with existing job system and RQ decorators

priyankeshh · 2025-09-16T13:07:44Z

I am getting this in pdm run worker terminal when i try to queue a rq worker with marker layout:

"D:\Extralit-gsoc\extralit\extralit               
-server\src\extralit_server\jobs\oc               
r_jobs.py", line 87, in                           
async_marker_layout_job                           
    layout_result =                               
_call_marker_layout_detection(str(p               
df_path), pages)                                  
                    ^^^^^^^^^^^^^^^               
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^               
^                                                 
  File                                            
"D:\Extralit-gsoc\extralit\extralit               
-server\src\extralit_server\jobs\oc               
r_jobs.py", line 213, in                          
_call_marker_layout_detection                     
    raise ImportError("Marker not                 
installed. Install with: pip                      
install marker-pdf") from e                       
ImportError: Marker not installed.                
Install with: pip install                         
marker-pdf

- Updated existing dependencies: pillow (to >=10.1.0), huggingface-hub (to >=0.30.0), ocrmypdf (to >=16.11.0), marker-pdf (to >=1.9.2), torch (to >=2.5.0), torchvision (to >=0.20.0), and transformers (to >=4.51.0).

- Updated `huggingface-hub` to version 0.34.0 in `pyproject.toml`. - Downgraded `opencv-python-headless` to version 4.11.0.86 in `pyproject.toml`. - Updated `marker-pdf` dependency to version 1.9.3 in `pyproject.toml`.

- Changed the type of `pages` parameter in `async_marker_layout_job` and `_call_marker_layout_detection` from `Optional[list[int]]` to `Optional[str]` to accept comma-separated page numbers. - Updated the documentation to reflect the new format for the `pages` parameter. - Enhanced error logging in the `async_marker_layout_job` to include exception traceback. - Removed commented-out code related to job metadata updates for clarity.

- Changed the output format in `_call_marker_layout_detection` from "markdown" to "json". - Updated type hint for the result variable to include `JSONOutput`. - Imported `JSONOutput` from the marker renderers for compatibility.

JonnyTran · 2025-09-16T20:53:52Z

+        # Create optimized configuration for layout detection
+        config_dict = {
+            "output_format": "json",
+            "parallel_factor": 1,
+        }
+
+        if pages is not None:
+            config_dict["page_range"] = pages
+
+        config = ConfigParser(config_dict)
+        model_dict = create_model_dict()
+
+        converter = PdfConverter(
+            config=config.generate_config_dict(),
+            artifact_dict=model_dict,
+        )
+
+        # Convert PDF - this will return a Document object with detected layout
+        result: "MarkdownOutput | JSONOutput" = converter(pdf_path)  # noqa: UP037
+        print(type(result))
+        pprint(result.model_dump())
+


@priyankeshh In order to make marker not use the OCR models, you will have to edit the model_dict and config_dict here

JonnyTran · 2025-09-16T20:55:23Z

+        layout_data = {"pages": []}
+
+        if hasattr(result, "pages") and result.pages:
+            for page_idx, page in enumerate(result.pages):
+                page_data = {"page": page_idx, "blocks": []}
+
+                # Extract blocks from the page
+                if hasattr(page, "blocks") and page.blocks:
+                    for block in page.blocks:
+                        if hasattr(block, "block_type") and hasattr(block, "bbox"):
+                            block_data = {
+                                "type": str(block.block_type).lower(),
+                                "bbox": list(block.bbox) if block.bbox else [],
+                                "id": str(getattr(block, "id", "")),
+                                "score": getattr(block, "confidence", 1.0),
+                            }
+
+                            # Add content based on block type
+                            if hasattr(block, "content"):
+                                block_data["content"] = str(block.content)
+                            elif hasattr(block, "text"):
+                                block_data["content"] = str(block.text)
+
+                            page_data["blocks"].append(block_data)
+
+                layout_data["pages"].append(page_data)
+
+        return layout_data
+


Parsing the layout json from marker output is currently broken, since results is being returned as a MarkdownOutput type, not JSONOutput. You may have to dig a bit to see how to do this, since marker don't really have documentations for it

JonnyTran · 2025-09-16T20:57:58Z

+        pdf_path = Path(pdf_path)
+        if not pdf_path.exists():
+            raise FileNotFoundError(f"PDF file not found: {pdf_path}")
+
+        _LOGGER.info(f"Starting Marker layout extraction for: {pdf_path}")
+
+        # Call Marker's layout detection
+        # Use the real PdfConverter API for layout detection
+        try:
+            # Process the PDF and get layout information
+            layout_result = _call_marker_layout_detection(str(pdf_path), pages)
+        except Exception as e:


Currently, passing in the pdf_path is fine for now as it makes it easier to test/develop, but later on it'll need to be converted to using s3_url to download the pdf as bytes, then pass it to marker as bytes and not a local file path

… detection - Implement three-function structure: create_marker_config(), run_marker(), parse_marker_output() - Fix marker to return JSONOutput instead of MarkdownOutput using ConfigParser - Optimize configuration for layout detection workflow - Successfully extract layout data with proper JSON structure

priyankeshh and others added 4 commits September 14, 2025 18:12

cleaned up some code

83cf490

minor issues

047b292

Merge branch 'develop' into feat/marker-layout-extraction

7f3181d

JonnyTran added 2 commits September 16, 2025 11:40

Update dependencies in pdm.lock and pyproject.toml

6474ecf

- Updated existing dependencies: pillow (to >=10.1.0), huggingface-hub (to >=0.30.0), ocrmypdf (to >=16.11.0), marker-pdf (to >=1.9.2), torch (to >=2.5.0), torchvision (to >=0.20.0), and transformers (to >=4.51.0).

Merge branch 'develop' into feat/marker-layout-extraction

bae8c2e

JonnyTran marked this pull request as ready for review September 16, 2025 18:48

JonnyTran requested a review from a team as a code owner September 16, 2025 18:48

JonnyTran added 4 commits September 16, 2025 12:25

fixes

d0af822

Update dependencies

1bdbf67

- Updated `huggingface-hub` to version 0.34.0 in `pyproject.toml`. - Downgraded `opencv-python-headless` to version 4.11.0.86 in `pyproject.toml`. - Updated `marker-pdf` dependency to version 1.9.3 in `pyproject.toml`.

Update OCR job output format to JSON

4ad6358

- Changed the output format in `_call_marker_layout_detection` from "markdown" to "json". - Updated type hint for the result variable to include `JSONOutput`. - Imported `JSONOutput` from the marker renderers for compatibility.

JonnyTran approved these changes Sep 16, 2025

View reviewed changes

Comment thread extralit-server/src/extralit_server/jobs/ocr_jobs.py Outdated

fix tests

a7c614d

JonnyTran reviewed Sep 16, 2025

View reviewed changes

priyankeshh changed the title ~~Added marker layout extraction job~~ [WIP] Added marker layout extraction job Sep 18, 2025

JonnyTran marked this pull request as draft September 19, 2025 19:22

JonnyTran added 2 commits September 19, 2025 13:36

refactoring

ef78863

refactoring

2e36b8c

JonnyTran marked this pull request as ready for review September 21, 2025 20:01

Merge branch 'develop' into feat/marker-layout-extraction

756a2b8

JonnyTran changed the title ~~[WIP] Added marker layout extraction job~~ [FEAT] marker PDF layout extraction job Sep 21, 2025

updated typer>=0.19.0

48ae7fa

JonnyTran merged commit 77c5213 into develop Sep 23, 2025
3 checks passed

JonnyTran mentioned this pull request Mar 27, 2026

Sparshr04/GitHub copilot auth UI #201

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEAT] marker PDF layout extraction job#155

[FEAT] marker PDF layout extraction job#155
JonnyTran merged 16 commits into
developfrom
feat/marker-layout-extraction

priyankeshh commented Sep 14, 2025

Uh oh!

priyankeshh commented Sep 16, 2025

Uh oh!

Uh oh!

JonnyTran Sep 16, 2025

Uh oh!

JonnyTran Sep 16, 2025

Uh oh!

JonnyTran Sep 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

priyankeshh commented Sep 14, 2025

Summary

✅ What Was Implemented

Core Features:

Files Created/Modified:

1. Main Job Implementation (ocr_jobs.py)

2. Enhanced Utility Functions

Key Technical Features:

Marker Integration:

Robust Data Processing:

🚀 Usage Example

🔄 Future Enhancement Opportunities

Uh oh!

priyankeshh commented Sep 16, 2025

Uh oh!

Uh oh!

JonnyTran Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

JonnyTran Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

JonnyTran Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants