Skip to content

Add Kreuzberg document intelligence MCP shared workflow#28392

Merged
pelikhan merged 1 commit intomainfrom
copilot/configure-mcp-workflow
Apr 25, 2026
Merged

Add Kreuzberg document intelligence MCP shared workflow#28392
pelikhan merged 1 commit intomainfrom
copilot/configure-mcp-workflow

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 25, 2026

Adds a shared MCP component wrapping the Kreuzberg document intelligence engine — 97+ format extraction (PDF, DOCX, images via Tesseract OCR, legacy Office via LibreOffice) over stdio JSON-RPC 2.0.

Changes

  • New file: .github/workflows/shared/mcp/kreuzberg.md
    • Container: ghcr.io/kreuzberg-dev/kreuzberg:latest, entrypoint arg mcp
    • Workspace mounted read-only (${GITHUB_WORKSPACE}:${GITHUB_WORKSPACE}:ro) so extract_file / batch_extract_files resolve paths without remapping
    • Allows 10 read-only tools; excludes mutating ops and feature-flag-gated tool:
allowed:
  - "extract_file"        # local file → text + metadata
  - "extract_bytes"       # base64 content → text + metadata
  - "batch_extract_files" # multi-file in one call
  - "detect_mime_type"
  - "list_formats"
  - "get_version"
  - "embed_text"
  - "chunk_text"
  - "cache_stats"
  - "cache_manifest"
  # excluded: cache_clear, cache_warm (write), extract_structured (requires liter-llm build flag)
  • No secrets required.
  • Use via imports: [shared/mcp/kreuzberg.md].

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/5c34a633-2709-4016-858d-9b24db09c75d

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
@pelikhan pelikhan marked this pull request as ready for review April 25, 2026 03:57
Copilot AI review requested due to automatic review settings April 25, 2026 03:57
@pelikhan pelikhan merged commit f187aa4 into main Apr 25, 2026
19 checks passed
@pelikhan pelikhan deleted the copilot/configure-mcp-workflow branch April 25, 2026 03:57
@github-actions github-actions Bot mentioned this pull request Apr 25, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a shared MCP server configuration for the Kreuzberg document intelligence engine, enabling read-only document extraction and related utilities via a containerized stdio JSON-RPC 2.0 server.

Changes:

  • Introduces a new shared MCP import file defining the Kreuzberg container server and a curated read-only tool allowlist.
  • Mounts the GitHub workspace read-only to allow extract_file / batch_extract_files to resolve repository paths.
  • Includes inline documentation describing available/excluded tools, container image options, and usage.
Show a summary per file
File Description
.github/workflows/shared/mcp/kreuzberg.md Defines the Kreuzberg MCP server (container + mounts + allowed tools) for reuse via imports.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 1/1 changed files
  • Comments generated: 1

Comment on lines +57 to +68
| Tool | Params | Description |
|---|---|---|
| `extract_file` | `path` | Extract text and metadata from a local file |
| `extract_bytes` | `data` (base64) | Extract from base64-encoded file content |
| `batch_extract_files` | `paths` | Extract multiple files in one call |
| `detect_mime_type` | `path` | Identify a file's MIME type |
| `list_formats` | — | List all supported file formats |
| `get_version` | — | Return the library version string |
| `embed_text` | `texts` | Generate embedding vectors for text chunks |
| `chunk_text` | `text` | Split text into overlapping chunks |
| `cache_stats` | — | Report how much content is cached |
| `cache_manifest` | — | Return model checksums |
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The markdown table under "Available tools (read-only)" uses double leading pipes (|| ...) on each row, which is not valid table syntax and makes the table hard to read/parse in raw form. Use standard single-pipe table rows (e.g., | Tool | Params | Description |, |---|---|---|, etc.).

Copilot uses AI. Check for mistakes.
github-actions Bot added a commit that referenced this pull request Apr 25, 2026
- Add Kreuzberg document intelligence MCP server as a dedicated row in
  the shared MCP configurations table in mcps.md (#28392)
- Update the observability glossary entry to mention github.workflow_ref
  as a resource attribute on all OTel spans, enabling workflow filtering
  in multi-workflow repositories (#28358)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants