[Summary] GraphGen Roadmap

# Backgraound
To establish GraphGen as an essential tool for training and evaluation data synthesis, its development roadmap focuses on two core pillars: implementing a robust, multi-dimensional **data quality assessment** and filtering system to ensure the reliability of generated knowledge graphs, and expanding its architecture to support **multi-modal** and **multi-omics** data inputs.

> If you'd like to work on one of these tasks, please comment below to claim it and create an issue for the feature you'll be implementing.

# Features
## 1 GraphGen Framework
- [x] ♻️ Refactor the pipeline around base classes: `Reader` → `KG_Builder` → `Partitioner` → `Generator`. Data flow: Raw corpus → Reader → Splitter → KG_Builder → Partitioner → Generator → training / evaluation data: https://github.com/open-sciencelab/GraphGen/pull/59, https://github.com/open-sciencelab/GraphGen/pull/58, https://github.com/open-sciencelab/GraphGen/pull/52
- [x] 🔧 Operator-orchestration backbone: https://github.com/open-sciencelab/GraphGen/pull/85
- [ ] 🔍 Data provenance: ensure every record in the final training/evaluation set can be traced back to its original raw corpus through the full pipeline.
## 2 Multi-Modal & Multi-Omics
- [x] 🧬 Define ImageNode, AudioNode, ProteinNode, etc.
- [x] 👁️‍🗨️ Vision–language fusion extraction: use open VLMs to generate "image–caption–entity" triples and write them into the graph: https://github.com/open-sciencelab/GraphGen/pull/69
- [ ] 🧪 Multi-omics extraction: process genomics/transcriptomics/proteomics with automatic node-property alignment
## 3 Data Quality & Curation
- [ ] 📊 Multi-dimensional quality metrics with a unified scoring API
- [x] 💓 Graph-quality assessment similar to KGHeartBeat: https://github.com/InternScience/GraphGen/pull/135
- [ ] 🎯 One-click export of high-quality sub-graphs and high-quality data
- [ ] ⚙️ Configurable pipeline: entity disambiguation, fact verification, redundancy removal, schema validation
## 4 Graph Construction
- [x] 🚀 Incremental & resumable construction: https://github.com/InternScience/GraphGen/pull/168
- [ ] ⚖️ Automatic data ratio optimization: dynamically adjust the mixing ratio of different data based on quality scores and training feedback to optimize model performance
## 5 Engineering
- [x] 📂 Support csv, json, jsonl, txt, pickle, parquet, pdf, and various triple input formats: https://github.com/open-sciencelab/GraphGen/pull/50, https://github.com/open-sciencelab/GraphGen/pull/65, https://github.com/open-sciencelab/GraphGen/pull/82, https://github.com/open-sciencelab/GraphGen/pull/81
- [x] 🗄️ Switchable unified storage layer: kv_storage(object store, rocksdb), graph_storage(networkx, kuzudb)
- [ ] ✅ Data validation powered by Pydantic
- [x] 👓 More inference servers and clients such as Azure and Ollama: https://github.com/open-sciencelab/GraphGen/pull/74
- [x] 🔍 test cases
## 6 Community Detection & Data Synthesis
- [ ] 🔎 Apply multiple community-detection algorithms; generate data from communities and provide typical samples plus visualizations
- [ ] 🧠 Community summary → CoT data: use community summaries as few-shot examples to synthesize high-quality chain-of-thought data
- [ ] 💬 Multi-turn dialogue synthesis: random-walk sampling → multi-turn Q&A while maintaining context consistency
- [ ] 📈 Complexity grading for curriculum learning
- [ ] 🕵️‍♂️ Support comparison with baselines
## 7 UX, Docs & Community
- [ ] 📦 Streamlined pip install and usage
- [ ] 📓 Jupyter tutorial suite
- [ ] 📚 Comprehensive documentation
- [ ] 🗃️ Data & user case library
- [x] 🤝 Contributor guide & roadmap: clear labels, branching strategy, PR template, code of conduct
- [ ] 🌐 More user-friendly web interface
## 8 Others
- [ ] 📝 More standardized prompt & post-processing management; post-processing should be bound to prompts
- [ ] 🌐 Improve online connectivity
- [ ] 🔗 Enhanced coreference resolution during chunking

Further feature ideas are welcome—feel free to suggest and join the plan!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Summary] GraphGen Roadmap #49

Backgraound

Features

1 GraphGen Framework

2 Multi-Modal & Multi-Omics

3 Data Quality & Curation

4 Graph Construction

5 Engineering

6 Community Detection & Data Synthesis

7 UX, Docs & Community

8 Others

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Summary] GraphGen Roadmap #49

Description

Backgraound

Features

1 GraphGen Framework

2 Multi-Modal & Multi-Omics

3 Data Quality & Curation

4 Graph Construction

5 Engineering

6 Community Detection & Data Synthesis

7 UX, Docs & Community

8 Others

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions