Phagocyte

End-to-end pipeline: Research → Parse References → Acquire Documents → Ingest → RAG Vector Store

An automated workflow that conducts AI-powered research, extracts and acquires academic papers, converts them into structured markdown, and creates a searchable vector database for RAG applications.

Pipeline Architecture

Modules

Module	Description	Key Features
researcher	AI-powered deep research	Gemini Deep Research, citation extraction
parser	Reference extraction & acquisition	Regex + AI parsing, multi-source downloads, DOI→BibTeX
ingestor	Document → Markdown conversion	PDF, Web, GitHub, YouTube, Audio support
processor	RAG document processing	AST-aware chunking, embeddings, LanceDB vector store

Quick Start

# Install
git clone https://github.com/SIslamMun/Phagocyte.git
cd Phagocyte && uv sync

# Run pipeline
uv run phagocyte research "HDF5 best practices" -o ./output
uv run phagocyte parse refs ./output/research_report.md --export-batch
uv run phagocyte parse batch ./output/batch.json -o ./papers
uv run phagocyte ingest batch ./papers -o ./markdown
uv run phagocyte process run ./markdown -o ./lancedb
uv run phagocyte process search ./lancedb "chunking strategies"

Commands

Research

uv run phagocyte research <topic>         # Deep research with Gemini

Options:

Option	Short	Description
`--output`	`-o`	Output directory for research results
`--mode`	`-m`	Research mode: `undirected`, `directed`, `no-research`
`--artifact`	`-a`	Include specific artifacts in output
`--api-key`		Google API key (or set `GOOGLE_API_KEY`)
`--verbose`	`-v`	Enable verbose logging

Parse (10 commands)

uv run phagocyte parse refs <file>        # Extract references from document
uv run phagocyte parse retrieve <id>      # Download single paper (DOI/arXiv/title)
uv run phagocyte parse batch <file>       # Batch download papers
uv run phagocyte parse doi2bib <doi>      # Convert DOI to BibTeX/JSON
uv run phagocyte parse verify <bib>       # Verify citations against CrossRef
uv run phagocyte parse citations <id>     # Get citation graph
uv run phagocyte parse sources            # List available paper sources
uv run phagocyte parse auth               # Authenticate with institution
uv run phagocyte parse init               # Initialize config file
uv run phagocyte parse config push/pull   # Sync config via GitHub gist

Options:

Command	Option	Description
`refs`	`--agent`	AI agent: `none`, `claude`, `gemini`
`refs`	`--export-batch`	Export batch JSON for downloads
`refs`	`--export-dois`	Export DOI list only
`refs`	`--compare`	Compare AI vs regex extraction
`retrieve`	`--email`	Email for API rate limits
`batch`	`--concurrent`	Max concurrent downloads
`doi2bib`	`--format`	Output format: `bibtex`, `json`, `markdown`
`verify`	`--dry-run`	Preview without making changes
`citations`	`--direction`	Citation direction: `refs`, `cited-by`, `both`

Ingest (5 commands)

uv run phagocyte ingest file <source>     # Single file/URL to markdown
uv run phagocyte ingest batch <dir>       # Batch process folder
uv run phagocyte ingest crawl <url>       # Deep crawl website
uv run phagocyte ingest clone <repo>      # Clone and ingest git repo
uv run phagocyte ingest describe <path>   # Generate VLM image descriptions

Options:

Command	Option	Description
`file`	`--describe-images`	Generate VLM descriptions for images
`file`	`--img-format`	Image format: `png`, `jpg`, `webp`
`batch`	`--recursive`	Process subdirectories
`batch`	`--concurrency`	Max parallel workers
`crawl`	`--max-pages`	Maximum pages to crawl
`crawl`	`--max-depth`	Maximum link depth
`crawl`	`--strategy`	Crawl strategy: `bfs`, `dfs`, `bestfirst`
`clone`	`--branch`	Git branch to clone
`clone`	`--shallow`	Shallow clone (faster)
`clone`	`--max-files`	Maximum files to process

Process (11 commands)

uv run phagocyte process run <input>      # Process into LanceDB
uv run phagocyte process search <db> <q>  # Search vector database
uv run phagocyte process stats <db>       # Show database statistics
uv run phagocyte process setup            # Download embedding models
uv run phagocyte process check            # Check service availability
uv run phagocyte process export <db> <out># Export database
uv run phagocyte process import <in> <db> # Import database
uv run phagocyte process visualize <db>   # Browse with lance-data-viewer
uv run phagocyte process deploy <db>      # Start web interface
uv run phagocyte process server <db>      # Deploy REST API via Docker
uv run phagocyte process test-e2e         # Run end-to-end validation

Options:

Command	Option	Description
`run`	`--text-profile`	Chunking profile: `low`, `medium`, `high`
`run`	`--code-profile`	Code chunking: `low`, `medium`, `high`
`run`	`--table-mode`	Table handling: `separate`, `unified`, `both`
`run`	`--incremental`	Only process new/changed files
`run`	`--batch-size`	Documents per batch
`search`	`--limit`	Maximum results to return
`search`	`--table`	Table to search: `text_chunks`, `code_chunks`
`search`	`--hybrid`	Enable hybrid search (vector + FTS)
`search`	`--rerank`	Rerank results with cross-encoder

Installation

git clone https://github.com/SIslamMun/Phagocyte.git
cd Phagocyte
uv sync

# Install module extras as needed
cd ingestor && uv sync --extra all && cd ..
cd processor && uv sync && cd ..

Requirements

Python 3.11+
uv package manager
Ollama for embeddings (processor)
API Keys: GOOGLE_API_KEY (researcher), ANTHROPIC_API_KEY (optional)

Documentation

src/researcher/README.md - Research module docs
src/parser/README.md - Parser module docs
src/ingestor/README.md - Ingestor module docs
src/processor/README.md - Processor module docs

MCP Servers

MCP servers for AI agent integration (Claude Desktop, Claude Code, Cursor, VS Code Copilot, Windsurf, Zed):

Server	Tools	Description
`researcher-mcp`	3	Deep research, list topics, get report
`parser-mcp`	7	Retrieve paper, batch download, parse refs, doi2bib, verify, citations, sources
`ingestor-mcp`	10	Ingest file/URL, batch, crawl, clone repo, describe images, transcribe audio
`processor-mcp`	6	Process files, search, stats, setup, check, export
`rag-mcp`	4	Semantic search, hybrid search, list tables, get schema

# Run from src/ directories
cd src/researcher && uv run researcher-mcp
cd src/parser && uv run parser-mcp
cd src/ingestor && uv run ingestor-mcp
cd src/processor && uv run processor-mcp
cd src/processor && uv run rag-mcp

Documentation:

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
img		img
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
prompt.md		prompt.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Phagocyte

Pipeline Architecture

Modules

Quick Start

Commands

Research

Parse (10 commands)

Ingest (5 commands)

Process (11 commands)

Installation

Requirements

Documentation

MCP Servers

License

About

Uh oh!

Releases

Packages

Languages

grc-iit/Phagocyte

Folders and files

Latest commit

History

Repository files navigation

Phagocyte

Pipeline Architecture

Modules

Quick Start

Commands

Research

Parse (10 commands)

Ingest (5 commands)

Process (11 commands)

Installation

Requirements

Documentation

MCP Servers

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages