🔬 PaperGraph

GraphRAG over arXiv — explore, ingest, and query scientific research through a structured knowledge graph.

PaperGraph is a high-performance GraphRAG (Retrieval-Augmented Generation over Knowledge Graphs) system. It transforms flat research papers into a multi-dimensional knowledge graph of Entities and Relationships, enabling deep reasoning that standard vector-based RAG cannot achieve.

🏗️ System Architecture

PaperGraph follows a two-stage architecture: Automated Ingestion and Hybrid Retrieval.

Stage 1 — Ingestion Pipeline

  +-------------+       +-------------------+       +-----------------------+
  |  arXiv API  | ----> | Python Ingestion  | ----> | LLM Extraction (GPT)  |
  +-------------+       +---------+---------+       +-----------+-----------+
                                  |                             |
                                  v                             v
                        +---------+---------+       +-----------+-----------+
                        |  Neo4j AuraDB     | <---- |   Knowledge Graph     |
                        +---------+---------+       +-----------------------+
                                  |
                                  v
                        +---------+---------+
                        |  Vector Index     |
                        +-------------------+

Stage 2 — Retrieval Pipeline (Hybrid GraphRAG)

      [ User Question ]
              |
              v
     +-----------------+
     |  Query Router   |
     +--------+--------+
              |
      /-------+-------\
      |               |
  [ GLOBAL ]      [ LOCAL ]
      |               |
  [ Cypher ]    [ Hybrid Search ]
      |               |
  [ Stats  ]    [ Graph Expand ]
      |               |
      \-------+-------/
              |
              v
     +-----------------+
     |  GPT-4o Answer  |
     +--------+--------+
              |
              v
         [ UI App ]

📂 Project Structure

The project is split into a robust Python backend and a premium React frontend.

papergraph/
├── backend/                # FastAPI Application
│   ├── main.py             # API Entry Point
│   ├── ingestion/          # arXiv fetching & processing
│   ├── graph/              # Neo4j connections & LLM extraction
│   └── retrieval/          # Hybrid GraphRAG logic
├── frontend/               # Next.js Application
│   ├── src/app/            # Pages & UI Components
│   ├── package.json        # Node.js dependencies
│   └── public/             # Static Assets
├── .env                    # Environment variables (Azure, Neo4j)
├── requirements.txt        # Python dependencies
└── README.md

---

## 🚀 Getting Started

### 1. Backend Setup
1. Create a `.env` file (see `.env.example`).
2. Install dependencies: `pip install -r requirements.txt`
3. Start the API: `uvicorn backend.main:app --reload`

### 2. Frontend Setup
1. Navigate to `/frontend`.
2. Install dependencies: `npm install`
3. Start the UI: `npm run dev`

### 3. Data Ingestion
PaperGraph is built for scale. 
*   **Current State:** Successfully ingested **50 documents** as a baseline.
*   **Scalability:** You can easily ingest **500, 5,000, or more** documents by adjusting the `--max` parameter in the ingestion script.
```bash
python -m backend.ingestion.pipeline --max 100 --category cs.AI

💡 Key Considerations

Knowledge Density: GraphRAG performs best when papers are within the same domain (e.g., cs.AI), allowing for rich link extraction between authors and methods.
Cost Efficiency: Entity extraction uses one LLM call per abstract. Ingesting 50 documents costs ~$0.10 with GPT-4o.
Neo4j Aura: Using the free tier of Neo4j Aura provides plenty of space for ~1,000 research papers and their associated entities. Note: Free tier instances may expire after 14 days of inactivity.

📊 Graph Schema

Node	Description	Relationships
Paper	Research Articles	`AUTHORED_BY`, `CITES`, `PROPOSES`
Author	Researchers	`AFFILIATED_WITH`
Method	Algorithms/Models	`USED_BY`, `PROPOSED_IN`
Institution	Universities/Labs	`AFFILIATES`

Built with Neo4j Aura · OpenAI GPT-4o · FastAPI · Next.js

🔗 Demo Video: You can see the application walkthrough here

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
backend		backend
docs		docs
frontend		frontend
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔬 PaperGraph

🏗️ System Architecture

Stage 1 — Ingestion Pipeline

Stage 2 — Retrieval Pipeline (Hybrid GraphRAG)

📂 Project Structure

💡 Key Considerations

📊 Graph Schema

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔬 PaperGraph

🏗️ System Architecture

Stage 1 — Ingestion Pipeline

Stage 2 — Retrieval Pipeline (Hybrid GraphRAG)

📂 Project Structure

💡 Key Considerations

📊 Graph Schema

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages