Skip to content

[FEAT] marker PDF layout extraction job#155

Merged
JonnyTran merged 16 commits into
developfrom
feat/marker-layout-extraction
Sep 23, 2025
Merged

[FEAT] marker PDF layout extraction job#155
JonnyTran merged 16 commits into
developfrom
feat/marker-layout-extraction

Conversation

@priyankeshh
Copy link
Copy Markdown
Contributor

Summary

I have successfully implemented a comprehensive Marker layout extraction system for the extralit-server. This implementation provides optimized PDF layout detection capabilities that can identify and extract tables, figures, and text blocks without requiring full OCR processing.

What Was Implemented

Core Features:

  1. Async/Sync Job Support - Full RQ job queue integration with both async and sync wrappers
  2. Multi-Element Extraction - Robust extraction of tables, figures, and text blocks with bounding box coordinates

Files Created/Modified:

1. Main Job Implementation (ocr_jobs.py)

  • async_marker_layout_job() - Main async job function with full RQ integration
  • marker_layout_job() - Synchronous wrapper for RQ compatibility
  • _call_marker_layout_detection() - Core Marker API integration using ConfigParser and PdfConverter
  • Comprehensive error handling and job metadata tracking

2. Enhanced Utility Functions

  • tables.py - Robust table extraction with multiple format support
  • figures.py - Comprehensive figure extraction with enhanced validation
  • text.py - Advanced text block extraction with subtype detection

Key Technical Features:

Marker Integration:

  • Uses PdfConverter with ConfigParser for optimal configuration
  • Supports create_model_dict() for proper model initialization
  • Handles page-range processing for large documents
  • Markdown output format for reliable block detection

Robust Data Processing:

  • Multiple naming convention support (bbox, coordinates, bounding_box, etc.)
  • Various data format handling (list, dict, nested structures)
  • Block type detection with extensive keyword matching
  • Confidence score extraction with fallbacks

🚀 Usage Example

# Extract layout from entire document
result = marker_layout_job("document.pdf", extract_text=True)

# Extract layout from specific pages
result = marker_layout_job("document.pdf", pages=[0, 1, 2])

# Result structure
{
    "tables": [{"page": 0, "bbox": [x1, y1, x2, y2], "score": 0.95, ...}],
    "figures": [{"page": 1, "bbox": [x1, y1, x2, y2], "caption": "...", ...}],
    "text_blocks": [{"page": 0, "bbox": [x1, y1, x2, y2], "content": "...", ...}],
    "metadata": {"source": "marker", "total_elements": 15, ...}
}

🔄 Future Enhancement Opportunities

  • API endpoint exposure for HTTP access
  • Batch processing support for multiple documents
  • Performance metrics and monitoring integration
  • Advanced configuration options for model tuning

priyankeshh and others added 4 commits September 14, 2025 18:12
- Completely replaced _call_marker_layout_detection with real marker-pdf integration
- Removed _get_mock_marker_output function in favor of actual API calls
- Updated utility functions to handle real Marker block types:
  - Tables: now handles 'table' and 'tablegroup' types
  - Figures: expanded to include 'picturegroup', 'figuregroup' types
  - Text: added support for 'sectionheader', 'textinlinemath', 'listitem', etc.
- Enhanced metadata extraction with polygon support
- Added marker-pdf as optional dependency in pyproject.toml
- Implements proper JSON output format with hierarchical block structure
- Maintains backward compatibility with existing job system and RQ decorators
@priyankeshh
Copy link
Copy Markdown
Contributor Author

I am getting this in pdm run worker terminal when i try to queue a rq worker with marker layout:

"D:\Extralit-gsoc\extralit\extralit               
-server\src\extralit_server\jobs\oc               
r_jobs.py", line 87, in                           
async_marker_layout_job                           
    layout_result =                               
_call_marker_layout_detection(str(p               
df_path), pages)                                  
                    ^^^^^^^^^^^^^^^               
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^               
^                                                 
  File                                            
"D:\Extralit-gsoc\extralit\extralit               
-server\src\extralit_server\jobs\oc               
r_jobs.py", line 213, in                          
_call_marker_layout_detection                     
    raise ImportError("Marker not                 
installed. Install with: pip                      
install marker-pdf") from e                       
ImportError: Marker not installed.                
Install with: pip install                         
marker-pdf                                        

- Updated existing dependencies: pillow (to >=10.1.0), huggingface-hub (to >=0.30.0), ocrmypdf (to >=16.11.0), marker-pdf (to >=1.9.2), torch (to >=2.5.0), torchvision (to >=0.20.0), and transformers (to >=4.51.0).
@JonnyTran JonnyTran marked this pull request as ready for review September 16, 2025 18:48
@JonnyTran JonnyTran requested a review from a team as a code owner September 16, 2025 18:48
- Updated `huggingface-hub` to version 0.34.0 in `pyproject.toml`.
- Downgraded `opencv-python-headless` to version 4.11.0.86 in `pyproject.toml`.
- Updated `marker-pdf` dependency to version 1.9.3 in `pyproject.toml`.
- Changed the type of `pages` parameter in `async_marker_layout_job` and `_call_marker_layout_detection` from `Optional[list[int]]` to `Optional[str]` to accept comma-separated page numbers.
- Updated the documentation to reflect the new format for the `pages` parameter.
- Enhanced error logging in the `async_marker_layout_job` to include exception traceback.
- Removed commented-out code related to job metadata updates for clarity.
- Changed the output format in `_call_marker_layout_detection` from "markdown" to "json".
- Updated type hint for the result variable to include `JSONOutput`.
- Imported `JSONOutput` from the marker renderers for compatibility.
Comment thread extralit-server/src/extralit_server/jobs/ocr_jobs.py Outdated
Comment on lines +164 to +185
# Create optimized configuration for layout detection
config_dict = {
"output_format": "json",
"parallel_factor": 1,
}

if pages is not None:
config_dict["page_range"] = pages

config = ConfigParser(config_dict)
model_dict = create_model_dict()

converter = PdfConverter(
config=config.generate_config_dict(),
artifact_dict=model_dict,
)

# Convert PDF - this will return a Document object with detected layout
result: "MarkdownOutput | JSONOutput" = converter(pdf_path) # noqa: UP037
print(type(result))
pprint(result.model_dump())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@priyankeshh In order to make marker not use the OCR models, you will have to edit the model_dict and config_dict here

Comment on lines +188 to +216
layout_data = {"pages": []}

if hasattr(result, "pages") and result.pages:
for page_idx, page in enumerate(result.pages):
page_data = {"page": page_idx, "blocks": []}

# Extract blocks from the page
if hasattr(page, "blocks") and page.blocks:
for block in page.blocks:
if hasattr(block, "block_type") and hasattr(block, "bbox"):
block_data = {
"type": str(block.block_type).lower(),
"bbox": list(block.bbox) if block.bbox else [],
"id": str(getattr(block, "id", "")),
"score": getattr(block, "confidence", 1.0),
}

# Add content based on block type
if hasattr(block, "content"):
block_data["content"] = str(block.content)
elif hasattr(block, "text"):
block_data["content"] = str(block.text)

page_data["blocks"].append(block_data)

layout_data["pages"].append(page_data)

return layout_data

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing the layout json from marker output is currently broken, since results is being returned as a MarkdownOutput type, not JSONOutput. You may have to dig a bit to see how to do this, since marker don't really have documentations for it

Comment on lines +86 to +97
pdf_path = Path(pdf_path)
if not pdf_path.exists():
raise FileNotFoundError(f"PDF file not found: {pdf_path}")

_LOGGER.info(f"Starting Marker layout extraction for: {pdf_path}")

# Call Marker's layout detection
# Use the real PdfConverter API for layout detection
try:
# Process the PDF and get layout information
layout_result = _call_marker_layout_detection(str(pdf_path), pages)
except Exception as e:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, passing in the pdf_path is fine for now as it makes it easier to test/develop, but later on it'll need to be converted to using s3_url to download the pdf as bytes, then pass it to marker as bytes and not a local file path

@priyankeshh priyankeshh changed the title Added marker layout extraction job [WIP] Added marker layout extraction job Sep 18, 2025
… detection

- Implement three-function structure: create_marker_config(), run_marker(), parse_marker_output()
- Fix marker to return JSONOutput instead of MarkdownOutput using ConfigParser
- Optimize configuration for layout detection workflow
- Successfully extract layout data with proper JSON structure
@JonnyTran JonnyTran marked this pull request as draft September 19, 2025 19:22
@JonnyTran JonnyTran marked this pull request as ready for review September 21, 2025 20:01
@JonnyTran JonnyTran changed the title [WIP] Added marker layout extraction job [FEAT] marker PDF layout extraction job Sep 21, 2025
@JonnyTran JonnyTran merged commit 77c5213 into develop Sep 23, 2025
3 checks passed
@JonnyTran JonnyTran mentioned this pull request Mar 27, 2026
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants