GraphRAG over arXiv — explore, ingest, and query scientific research through a structured knowledge graph.
PaperGraph is a high-performance GraphRAG (Retrieval-Augmented Generation over Knowledge Graphs) system. It transforms flat research papers into a multi-dimensional knowledge graph of Entities and Relationships, enabling deep reasoning that standard vector-based RAG cannot achieve.
PaperGraph follows a two-stage architecture: Automated Ingestion and Hybrid Retrieval.
+-------------+ +-------------------+ +-----------------------+
| arXiv API | ----> | Python Ingestion | ----> | LLM Extraction (GPT) |
+-------------+ +---------+---------+ +-----------+-----------+
| |
v v
+---------+---------+ +-----------+-----------+
| Neo4j AuraDB | <---- | Knowledge Graph |
+---------+---------+ +-----------------------+
|
v
+---------+---------+
| Vector Index |
+-------------------+
[ User Question ]
|
v
+-----------------+
| Query Router |
+--------+--------+
|
/-------+-------\
| |
[ GLOBAL ] [ LOCAL ]
| |
[ Cypher ] [ Hybrid Search ]
| |
[ Stats ] [ Graph Expand ]
| |
\-------+-------/
|
v
+-----------------+
| GPT-4o Answer |
+--------+--------+
|
v
[ UI App ]
The project is split into a robust Python backend and a premium React frontend.
papergraph/
├── backend/ # FastAPI Application
│ ├── main.py # API Entry Point
│ ├── ingestion/ # arXiv fetching & processing
│ ├── graph/ # Neo4j connections & LLM extraction
│ └── retrieval/ # Hybrid GraphRAG logic
├── frontend/ # Next.js Application
│ ├── src/app/ # Pages & UI Components
│ ├── package.json # Node.js dependencies
│ └── public/ # Static Assets
├── .env # Environment variables (Azure, Neo4j)
├── requirements.txt # Python dependencies
└── README.md
---
## 🚀 Getting Started
### 1. Backend Setup
1. Create a `.env` file (see `.env.example`).
2. Install dependencies: `pip install -r requirements.txt`
3. Start the API: `uvicorn backend.main:app --reload`
### 2. Frontend Setup
1. Navigate to `/frontend`.
2. Install dependencies: `npm install`
3. Start the UI: `npm run dev`
### 3. Data Ingestion
PaperGraph is built for scale.
* **Current State:** Successfully ingested **50 documents** as a baseline.
* **Scalability:** You can easily ingest **500, 5,000, or more** documents by adjusting the `--max` parameter in the ingestion script.
```bash
python -m backend.ingestion.pipeline --max 100 --category cs.AI
- Knowledge Density: GraphRAG performs best when papers are within the same domain (e.g.,
cs.AI), allowing for rich link extraction between authors and methods. - Cost Efficiency: Entity extraction uses one LLM call per abstract. Ingesting 50 documents costs ~$0.10 with GPT-4o.
- Neo4j Aura: Using the free tier of Neo4j Aura provides plenty of space for ~1,000 research papers and their associated entities. Note: Free tier instances may expire after 14 days of inactivity.
| Node | Description | Relationships |
|---|---|---|
| Paper | Research Articles | AUTHORED_BY, CITES, PROPOSES |
| Author | Researchers | AFFILIATED_WITH |
| Method | Algorithms/Models | USED_BY, PROPOSED_IN |
| Institution | Universities/Labs | AFFILIATES |
Built with Neo4j Aura · OpenAI GPT-4o · FastAPI · Next.js
🔗 Demo Video: You can see the application walkthrough here