This project implements a Java code analyzer that uses Retrieval-Augmented Generation (RAG) techniques combined with the Ollama language model to provide intelligent responses to queries about Java codebases.
- Parses and indexes Java source code from a specified repository
- Uses Tree-sitter for efficient code parsing
- Implements a RAG (Retrieval-Augmented Generation) system for context-aware code understanding
- Integrates with Ollama for generating human-like responses to code-related queries
- Includes caching mechanisms for improved performance on repeated analyses
- Optimized for handling large codebases with multi-processing capabilities
Before you begin, ensure you have met the following requirements:
- Python 3.7+
- Ollama installed and running locally with the
llama3.1model - Git (for cloning the repository)
-
Clone the repository:
git clone https://github.com/sydowma/ragCodeKnowledge.git cd ragCodeKnowledge -
Install the required Python packages:
pip install -r requirements.txt -
Ensure you have the Tree-sitter Java grammar:
git clone https://github.com/tree-sitter/tree-sitter-java.git
-
Update the
repo_pathin the script to point to your Java repository:repo_path = "/path/to/your/java/repository"
-
Run the script:
python optimized_rag_java_analyzer.py -
When prompted, enter your query about the Java codebase. For example:
Enter your query (or 'quit' to exit): what is xxxx? -
The system will retrieve relevant code snippets and generate a response using Ollama.
-
To exit, type 'quit' when prompted for a query.
-
Code Indexing: The system parses Java files in the specified repository using Tree-sitter and creates an index of code snippets.
-
Caching: Indexed data is cached for faster subsequent runs. The cache is invalidated if the repository content changes.
-
Query Processing: When a query is received, the system retrieves relevant code snippets using semantic similarity search.
-
Response Generation: Relevant snippets are sent to Ollama along with the query to generate a contextualized response.
You can modify the following parameters in the script:
JAVA_LANGUAGE_PATH: Path to the Tree-sitter Java language filekinquery_codefunction: Number of relevant snippets to retrieve (default is 5)- Ollama model in
query_ollamafunction (default is "deepseek-coder-v2")
- Implement incremental updates for the code index
- Add support for other programming languages
- Improve query understanding with more advanced NLP techniques
- Integrate with IDEs or code editors for seamless usage
Contributions to improve the Java Code Analyzer are welcome. Please feel free to submit a Pull Request.
If you have any questions or feedback, please open an issue in the GitHub repository.