Conversation
a3dd470 to
ab68b9f
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces the GraphRAG LLM package, which consolidates and refactors the language model infrastructure. The changes replace the existing graphrag.language_model module with a new graphrag-llm package, standardizing the interface for completion and embedding models across the codebase.
Changes:
- Removed the existing language model implementation and replaced it with the new
graphrag-llmpackage - Updated configuration models to use
completion_model_idandembedding_model_idinstead of genericmodel_id - Refactored all workflows and operations to use the new LLM interfaces (
LLMCompletionandLLMEmbedding)
Reviewed changes
Copilot reviewed 257 out of 260 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/verbs/util.py | Removed deprecated model configuration constants |
| tests/verbs/test_*.py | Updated test files to use new configuration utility functions |
| tests/unit/config/utils.py | Added new model config assertion helpers and updated default configs |
| tests/unit/config/test_config.py | Removed validation tests that are now handled by the LLM package |
| packages/graphrag/graphrag/config/models/*_config.py | Updated field names from model_id to completion_model_id/embedding_model_id |
| packages/graphrag/graphrag/index/workflows/*.py | Refactored to use new LLM creation functions and interfaces |
| packages/graphrag/graphrag/query/**/*.py | Updated query components to use new LLM interfaces |
| packages/graphrag/graphrag/language_model/** | Removed old language model implementation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 251 out of 265 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
* Remove graph embedding and UMAP (#2048) * Remove umap/layout operation * Remove graph embedding * Bump unified-search to GR 2.5.0 * Remove graph vis from unified-search * Remove file filtering (#2050) * Remove document filtering * Semver * Fix integ tests * Fix file find tuple * Fix another dangling find tuple * Remove text unit grouping (#2052) * Remove text unit group_by_columns * Semver * Fix default token split test * Fix models in config test samples * Fix token length in context sort test * Fix document sort * Re-implement hierarchical Leiden (#2049) * Use graspologic-native hierarchical leiden * Re-implement largest_connected_component * Copy in modularity * Use graspologic-native directly in pyproject * Remove directed graph tests (we don't use this) * Semver * Remove graspologic dep * Use 4.1 and text-embedding-3-large as defaults * Update comment * Clean vector store (#2077) * clean vector store code * fix * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Update v3/main missing config + functions (#2082) * reduce schema fields (#2089) * reduce schema fields * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Remove strategy dicts (#2090) * Remove "strategy" from community reports config/workflow * Remove extraction strategy from extract_graph * Remove summarization strategy from extract_graph * Remove strategy from claim extraction * Strongly type prompt templates * Remove strategy from embed_text * Push hydrated params into community report workflows * Push hyrdated params into extract covariates * Push hydrated params into extract graph NLP * Push hydrated params into extract graph * Push hydrated params into text embeddings * Remove a few more low-level defaults * Semver * Remove configurable prompt delimiters * Update smoke tests * Remove fnllm (#2095) * Sort deps alpha * Remove multi search (#2093) * Remove multi-search from CLI * Remove multi-search from API * Flatten vector_store config * Push hydrated vector store down to embed_text * Remove outputs from config * Remove multi-search notebook/docs * Add missing response_type in basic search API * Fix basic search context and id mapping * Fix v1 migration notebook * Fix query entity search tests * V3 docs and cleanup (#2100) * Remove community contrib notebooks * Add migration notebook and breaking changes page edits * Update/polish docs * Make model instance name configurable * Add vector schema updates to v3 migration notebook * Spellcheck * Bump smoke test runtimes * Remove document overwrite (#2101) * remove document overwrite from vector store configuration * remove document overwrite and refactor load documents method * fix test * fix test * fix test --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Unified factory (#2105) * Simplify Factory interface * Migrate CacheFactory to standard base class * Migrate LoggerFactory to standard base class * Migrate StorageFactory to standard base class * Migrate VectorStoreFactory to standard base class * Update vector store example notebook * Delete notebook outputs * Move default providers into factories * Move retry/limit tests into integ * Split language model factories * Set smoke test tpm/rpm * Fix factory integ tests * Add method to smoke test, switch text to 'fast' * Fix text smoke config for fast workflow * Add new workflows to text smoke test * Convert input readers to a proper factory * Remove covariates from fast smoke test * Update docs for input factory * Bump smoke runtime * Even longer runtime * min-csv timeout * Remove unnecessary lambdas * Prefix vector store (#2106) * add prefix to vector store configuration and removal of container name * docs updated * change prefix property name * change prefix property name * feedback implemented --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * fix for container name * Restructure project as monorepo. (#2111) * Restructure project as monorepo. * Fix formatting * Storage fixes and cleanup (#2118) * Fix pipeline recursion * Remove base_dir from storage.find * Remove max_count from storage.find * Remove prefix on storage integ test * Add base_dir in creation_date test * Wrap base_dir in Path * Use constants for input/update directories * Nov 2025 housekeeping (#2120) * Remove gensim sideload * Split CI build/type checks from unit tests * Thorough review of docs to align with v3 * Format * Fix version * Fix type * Graphrag config (#2119) * Add load_config to graphrag-common package. * Empty graph guards (#2126) * Remove networkx from graph_extractor and clean out redundancy * Bubble pipeline error to console * Remove embeddings optional new (#2128) * remove optional embeddings * fix test * fix tests * fix pipeline * fix test * fix test * fix test * fix tests --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Format * Add empty checks for NLP graphs (#2133) * Init command asks for models (#2137) * Add init prompting for models * Remove hard-coded model config validation * Switch to typer option prompt for full CLI use with models * Update getting started for init model input * Bump request timeout and overall smoke test timeout * Add graphrag-storage. (#2127) * Add graphrag-storage. * Python update (3.13) (#2149) * Update to python 3.14 as default, with range down to 3.10 * Fix enum value in query cli * Update pyarrow * Update py version for storage package * Remove 3.10 * add fastuuid * Update Python support to 3.11-3.14 with stricter dependency constraints - Set minimum Python version to 3.11 (removed 3.10 support) - Added support for Python 3.14 - Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14 - Fixed license format to use SPDX-compatible format for Python 3.14 - Updated pyarrow to >=22.0.0 for Python 3.14 wheel support - Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility - Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control - Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app * update uv lock * Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability * Update uv lock * Update numpy to >=2.0.0 for Python 3.14 Windows compatibility Numpy 1.25.x has access violation issues on Python 3.14 Windows. Numpy 2.x has proper Python 3.14 support including Windows wheels. * update uv lock * Update pandas to >=2.3.0 for numpy 2.x compatibility Pandas 2.2.x was compiled against numpy 1.x and causes ABI incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports numpy 2.x properly. * update uv.lock * Add scipy>=1.15.0 for numpy 2.x compatibility Scipy versions < 1.15.0 have C extensions built against numpy 1.x and are incompatible with numpy 2.x, causing dtype size errors. * update uv lock * Update Python support to 3.11-3.13 with compatible dependencies - Set Python version range to 3.11-3.13 (removed 3.14 support) - Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13 - Dependencies optimized for Python 3.13 compatibility: - pyarrow~=22.0 (has Python 3.13 wheels) - numpy~=1.26 - pandas~=2.2 - blis~=1.0 - fastuuid~=0.13 - Applied stricter version constraints using ~= operator throughout - Updated uv.lock with resolved dependencies * Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility Numpy 1.26.x causes access violations on Python 3.13 Windows. Numpy 2.1+ has proper Python 3.13 support with Windows wheels. Pandas 2.3+ is required for numpy 2.x compatibility. * update vsts.yml python version * Add GraphRAG Cache package. (#2153) * Add GraphRAG Cache package. * Fix a bunch of module comments and function visibility (#2154) * Issue #2004 fix (#2159) * fix issue #2004 using KeenhoChu idea in his PR * add unit test for dynamic community selection * add unit test for dynamic community selection implementing #2158 logic --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (#2161) * fix issue #860 for mismatch in prompts and input * fix format --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Chunker factory (#2156) * Delete NoopTextSplitter * Delete unused check_token_limit * Add base chunking factory and migrate workflow to use it * Split apart chunker module * Co-locate chunking/splitting * Collapse token splitting functionality into one class/function * Restore create_base_text_units parameterization * Move Tokenizer base class to common package * Move pre-pending into chunkers * Streamline config * Fix defaults construction * Add prepending tests * Remove chunk_size_includes_metadata config * Revert ChunkingDocument interface * Move metadata prepending to a util * Move Tokenizer back to GR core * Fix tokenizer removal from chunker * Set defaults for chunking config * Move chunking to monorepo package * Format * Typo * Add ChunkResult model * Streamline chunking config * Add missing version updates for graphrag_chunking * Input factory (#2168) * Update input factory to match other factories * Move input config alongside input readers * Move file pattern logic into InputReader * Set encoding default * Clean up optional column configs * Combine structured data extraction * Remove pandas from input loading * Throw if empty documents * Add json lines (jsonl) input support * Store raw data * Fix merge imports * Move metadata handling entirely to chunking * Nicer automatic title * Typo * Add get_property utility for nested dictionary access with dot notation * Update structured_file_reader to use get_property utility * Extract input module into new graphrag-input monorepo package - Create new graphrag-input package with input loading utilities - Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text) - Add get_property utility for nested dictionary access with dot notation - Include hashing utility for document ID generation - Update all imports throughout codebase to use graphrag_input - Add package to workspace configuration and release tasks - Remove old graphrag.index.input module * Rename ChunkResult to TextChunk and add transformer support - Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk - Add 'original' field to TextChunk to track pre-transform text - Add optional transform callback to chunker.chunk() method - Add add_metadata transformer for prepending metadata to chunks - Update create_chunk_results to apply transforms and populate original - Update sentence_chunker and token_chunker with transform support - Refactor create_base_text_units to use new transformer pattern - Rename pluck_metadata to get/collect methods on TextDocument * Back-compat comment * Align input config type name with other factory configs * Add MarkItDown support * Remove pattern default from MarkItDown reader * Remove plugins flag (implicit disabled) * Format * Update verb tests * Separate storage from input config * Add empty objects for NaN raw_data * Fix smoke tests * Fix BOM in csv smoke * Format * DRIFT fixes (#2171) * Use stable ids for community reports * Remove deprecated title from embedding flow * Remove embedding column from df loaders * Fix lancedb insertion * Add drift back to smoke tests * Fix mock embedder to match default embedding length * Fix DRIFT notebook * Push drift_k_followups through to prompt * Format * Vector package (#2172) * Extract graphrag-vectors package * Simplify vector factory usage and config defaults * Update factory integ initializers * Fix mock patch * Format * Register vector stores in tests * Set a default vector store name * Update vector readme * Remove impls from init * Move some validation into impls * Remove index_prefix * Move duplicate method to base class * Fix smoke vector config * Update index bug (#2173) * fix update index bug * blob storage bug fix --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Add GraphRAG LLM package. (#2174) * Update documentation for v3 release (#2176) update documentation for v3 release Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Graphrag llm cleanup (#2181) * Migration update (#2180) * fix formatting. --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com>
Add GraphRAG LLM package.