PR Comment Analyzer

A data-driven tool that uses LLM to analyze pull request comments from popular open-source repositories, extract meaningful insights, and visualize patterns in code review feedback.

View the notebook here: PR Comment Analysis Notebook

Features

Fetches PR comments from GitHub repositories using GitHub API
Uses LLM to extract structured insights from review comments. These insights can be used for Model Training, Fine-Tuning, and other purposes. Insights are structured in a CSV format with the following columns:
- Issue
- Code Before
- Code After
- Lesson
- Source
- Issue Tags
- Lesson Tags
Generates comprehensive visualizations:
- Word clouds of common issues and lessons
- Treemaps showing issue and lesson hierarchies
- Bar charts of frequency distributions
- Summary statistics dashboard

Data Structure

The tool extracts structured insights from PR comments and stores them in CSV format. Here are example entries from the dataset:

Issue	Code Before	Code After	Lesson	Source	Issue Tags	Lesson Tags
Missing header include for std::iota	`// device_util_id.h` `// Uses std::iota but missing <numeric>`	`#include <numeric>`	Always include necessary standard headers when using library functions to ensure portability across platforms.	issue_comment	`['missing_include', 'portability', 'build_error']`	`['defensive_programming', 'portability']`
Register method not publicly accessible	`class CopyTensor {` `protected:` `static Status Register(...);` `private:` `// ...` `};`	`class CopyTensor {` `public:` `// Dynamic registered devices use this function` `static void DynamicRegister(...);` `static Status Register(...);` `};`	Ensure methods intended for external or dynamic use have appropriate visibility (public) to prevent access errors.	review_comment	`['access_modifier', 'api_design']`	`['api_design', 'code_accessibility']`

Each row captures:

Issue: The problem identified in the code review
Code Before/After: The actual code changes made
Lesson: The key learning extracted from the review
Source: Whether it came from an issue or review comment
Tags: Categorization of both the issue and the lesson learned

Visualizations

Summary Dashboard

Issue Analysis

Lessons Learned

Performance Metrics

Setup

Install dependencies:

pip install requests pandas plotly openai pydantic wordcloud

Set up GitHub authentication:

Create a GitHub Personal Access Token
Set it as an environment variable:

export GITHUB_TOKEN='your_token_here'

Run the notebook:

Open pr_comment_section.ipynb
Run cells sequentially to fetch data and generate visualizations

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
gauge.png		gauge.png
issue_bar.png		issue_bar.png
issue_cloud.png		issue_cloud.png
issue_treemap.png		issue_treemap.png
lesson_bar.png		lesson_bar.png
lesson_cloud.png		lesson_cloud.png
lesson_treemap.png		lesson_treemap.png
pr_comment_section.ipynb		pr_comment_section.ipynb
pr_data.json		pr_data.json
pr_insights.csv		pr_insights.csv
summary.png		summary.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PR Comment Analyzer

Features

Data Structure