[FEAT] marker PDF layout extraction job#155
Conversation
- Completely replaced _call_marker_layout_detection with real marker-pdf integration - Removed _get_mock_marker_output function in favor of actual API calls - Updated utility functions to handle real Marker block types: - Tables: now handles 'table' and 'tablegroup' types - Figures: expanded to include 'picturegroup', 'figuregroup' types - Text: added support for 'sectionheader', 'textinlinemath', 'listitem', etc. - Enhanced metadata extraction with polygon support - Added marker-pdf as optional dependency in pyproject.toml - Implements proper JSON output format with hierarchical block structure - Maintains backward compatibility with existing job system and RQ decorators
|
I am getting this in |
- Updated existing dependencies: pillow (to >=10.1.0), huggingface-hub (to >=0.30.0), ocrmypdf (to >=16.11.0), marker-pdf (to >=1.9.2), torch (to >=2.5.0), torchvision (to >=0.20.0), and transformers (to >=4.51.0).
- Updated `huggingface-hub` to version 0.34.0 in `pyproject.toml`. - Downgraded `opencv-python-headless` to version 4.11.0.86 in `pyproject.toml`. - Updated `marker-pdf` dependency to version 1.9.3 in `pyproject.toml`.
- Changed the type of `pages` parameter in `async_marker_layout_job` and `_call_marker_layout_detection` from `Optional[list[int]]` to `Optional[str]` to accept comma-separated page numbers. - Updated the documentation to reflect the new format for the `pages` parameter. - Enhanced error logging in the `async_marker_layout_job` to include exception traceback. - Removed commented-out code related to job metadata updates for clarity.
- Changed the output format in `_call_marker_layout_detection` from "markdown" to "json". - Updated type hint for the result variable to include `JSONOutput`. - Imported `JSONOutput` from the marker renderers for compatibility.
| # Create optimized configuration for layout detection | ||
| config_dict = { | ||
| "output_format": "json", | ||
| "parallel_factor": 1, | ||
| } | ||
|
|
||
| if pages is not None: | ||
| config_dict["page_range"] = pages | ||
|
|
||
| config = ConfigParser(config_dict) | ||
| model_dict = create_model_dict() | ||
|
|
||
| converter = PdfConverter( | ||
| config=config.generate_config_dict(), | ||
| artifact_dict=model_dict, | ||
| ) | ||
|
|
||
| # Convert PDF - this will return a Document object with detected layout | ||
| result: "MarkdownOutput | JSONOutput" = converter(pdf_path) # noqa: UP037 | ||
| print(type(result)) | ||
| pprint(result.model_dump()) | ||
|
|
There was a problem hiding this comment.
@priyankeshh In order to make marker not use the OCR models, you will have to edit the model_dict and config_dict here
| layout_data = {"pages": []} | ||
|
|
||
| if hasattr(result, "pages") and result.pages: | ||
| for page_idx, page in enumerate(result.pages): | ||
| page_data = {"page": page_idx, "blocks": []} | ||
|
|
||
| # Extract blocks from the page | ||
| if hasattr(page, "blocks") and page.blocks: | ||
| for block in page.blocks: | ||
| if hasattr(block, "block_type") and hasattr(block, "bbox"): | ||
| block_data = { | ||
| "type": str(block.block_type).lower(), | ||
| "bbox": list(block.bbox) if block.bbox else [], | ||
| "id": str(getattr(block, "id", "")), | ||
| "score": getattr(block, "confidence", 1.0), | ||
| } | ||
|
|
||
| # Add content based on block type | ||
| if hasattr(block, "content"): | ||
| block_data["content"] = str(block.content) | ||
| elif hasattr(block, "text"): | ||
| block_data["content"] = str(block.text) | ||
|
|
||
| page_data["blocks"].append(block_data) | ||
|
|
||
| layout_data["pages"].append(page_data) | ||
|
|
||
| return layout_data | ||
|
|
There was a problem hiding this comment.
Parsing the layout json from marker output is currently broken, since results is being returned as a MarkdownOutput type, not JSONOutput. You may have to dig a bit to see how to do this, since marker don't really have documentations for it
| pdf_path = Path(pdf_path) | ||
| if not pdf_path.exists(): | ||
| raise FileNotFoundError(f"PDF file not found: {pdf_path}") | ||
|
|
||
| _LOGGER.info(f"Starting Marker layout extraction for: {pdf_path}") | ||
|
|
||
| # Call Marker's layout detection | ||
| # Use the real PdfConverter API for layout detection | ||
| try: | ||
| # Process the PDF and get layout information | ||
| layout_result = _call_marker_layout_detection(str(pdf_path), pages) | ||
| except Exception as e: |
There was a problem hiding this comment.
Currently, passing in the pdf_path is fine for now as it makes it easier to test/develop, but later on it'll need to be converted to using s3_url to download the pdf as bytes, then pass it to marker as bytes and not a local file path
… detection - Implement three-function structure: create_marker_config(), run_marker(), parse_marker_output() - Fix marker to return JSONOutput instead of MarkdownOutput using ConfigParser - Optimize configuration for layout detection workflow - Successfully extract layout data with proper JSON structure
Summary
I have successfully implemented a comprehensive Marker layout extraction system for the extralit-server. This implementation provides optimized PDF layout detection capabilities that can identify and extract tables, figures, and text blocks without requiring full OCR processing.
✅ What Was Implemented
Core Features:
Files Created/Modified:
1. Main Job Implementation (ocr_jobs.py)
async_marker_layout_job()- Main async job function with full RQ integrationmarker_layout_job()- Synchronous wrapper for RQ compatibility_call_marker_layout_detection()- Core Marker API integration using ConfigParser and PdfConverter2. Enhanced Utility Functions
Key Technical Features:
Marker Integration:
PdfConverterwithConfigParserfor optimal configurationcreate_model_dict()for proper model initializationRobust Data Processing:
🚀 Usage Example
🔄 Future Enhancement Opportunities