Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
220 changes: 166 additions & 54 deletions .github/workflows/bot-detection.yml
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
name: Bot Detection
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow description line with DOI reference has been removed. This removes valuable academic attribution and context about the bot detection methodology. Consider keeping this reference as a comment within the workflow or in accompanying documentation, as it provides scientific credibility and helps maintainers understand the underlying detection approach.

Suggested change
name: Bot Detection
name: Bot Detection
# Bot detection heuristics used in this workflow are based on published research.
# For the full methodology and DOI reference, see docs/bot-detection-methodology.md.

Copilot uses AI. Check for mistakes.
description: "Detect potential bots by analyzing comment similarity. DOI: https://doi.org/10.1145/3387940.3391503"

on:
workflow_dispatch:
schedule:
- cron: "17 3 * * *" # daily
- cron: "0 * * * *"

permissions:
contents: read
Expand All @@ -19,47 +18,94 @@ jobs:
uses: actions/github-script@v7
with:
script: |
const DAYS_BACK = 3;
const MAX_PR = 200;
const MIN_ACCOUNT_AGE_DAYS = 7;
const HOURS_BACK = 6;
const MAX_PR = 50;
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MAX_PR limit has been reduced from 200 to 50, which is a 75% reduction in the number of PRs scanned. Combined with the 6-hour time window, this significantly limits the scope of bot detection:

Impact: In active repositories, 50 PRs might only represent a small fraction of activity in a 6-hour window. Bots operating on PRs outside this limit won't be detected.

Consider whether 50 PRs is sufficient for your repository's activity level. If the repository receives more than 50 PR updates in 6 hours, you should increase this limit or implement a more sophisticated filtering strategy (e.g., prioritize newly created PRs over updated ones).

This issue also appears in the following locations of the same file:

  • line 6
  • line 21

Copilot uses AI. Check for mistakes.
const MIN_ACCOUNT_AGE_DAYS = 14;

const cutoff = new Date(Date.now() - DAYS_BACK * 24 * 60 * 60 * 1000);
const cutoff = new Date(Date.now() - HOURS_BACK * 60 * 60 * 1000);

console.log(`🔍 Scanning for new accounts created in last ${MIN_ACCOUNT_AGE_DAYS} days...`);
console.log(`📊 Checking ${MAX_PR} most recent PRs...`);
const fs = require('fs');
function appendSummary(markdown) {
const summaryPath = process.env.GITHUB_STEP_SUMMARY;
if (!summaryPath) return;
fs.appendFileSync(summaryPath, `${markdown}\n`);
}

// Fetch recent PRs
const { data: prs } = await github.rest.pulls.list({
owner: context.repo.owner,
repo: context.repo.repo,
state: 'all',
sort: 'updated',
direction: 'desc',
per_page: 100,
});
// Fetch recent PRs (up to MAX_PR)
const prs = [];
if (github.paginate?.iterator) {
for await (const response of github.paginate.iterator(github.rest.pulls.list, {
owner: context.repo.owner,
repo: context.repo.repo,
state: 'all',
sort: 'updated',
direction: 'desc',
per_page: 100,
})) {
prs.push(...response.data);
if (prs.length >= MAX_PR) break;
}
} else {
const { data } = await github.rest.pulls.list({
owner: context.repo.owner,
repo: context.repo.repo,
state: 'all',
sort: 'updated',
direction: 'desc',
per_page: Math.min(100, MAX_PR),
});
prs.push(...data);
}

const highRiskAccounts = new Map();
const commentsByUser = new Map();
const userCreatedDates = new Map();

console.log(`\n📝 Fetching comments from ${Math.min(prs.length, MAX_PR)} PRs...`);

for (const pr of prs.slice(0, MAX_PR)) {
if (new Date(pr.updated_at) < cutoff) continue;

const { data: issueComments } = await github.rest.issues.listComments({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: pr.number,
});
const issueComments = [];
if (github.paginate?.iterator) {
for await (const response of github.paginate.iterator(github.rest.issues.listComments, {
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: pr.number,
per_page: 100,
})) {
issueComments.push(...response.data);
}
} else {
const { data } = await github.rest.issues.listComments({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: pr.number,
per_page: 100,
});
issueComments.push(...data);
}
Comment on lines +67 to +85
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pagination logic for issue comments and review comments fetches all pages without any limit. For PRs with thousands of comments, this could:

  1. Cause performance issues and long execution times
  2. Significantly increase API usage
  3. Potentially hit API rate limits when combined with hourly execution

Consider adding a reasonable limit on the number of comments fetched per PR (e.g., first 500 comments) to balance thoroughness with performance. This is especially important given the hourly execution schedule.

This issue also appears on line 87 of the same file.

Copilot uses AI. Check for mistakes.

const { data: reviewComments } = await github.rest.pulls.listReviewComments({
owner: context.repo.owner,
repo: context.repo.repo,
pull_number: pr.number,
});
const reviewComments = [];
if (github.paginate?.iterator) {
for await (const response of github.paginate.iterator(github.rest.pulls.listReviewComments, {
owner: context.repo.owner,
repo: context.repo.repo,
pull_number: pr.number,
per_page: 100,
})) {
reviewComments.push(...response.data);
}
} else {
const { data } = await github.rest.pulls.listReviewComments({
owner: context.repo.owner,
repo: context.repo.repo,
pull_number: pr.number,
per_page: 100,
});
reviewComments.push(...data);
}

for (const comment of [...issueComments, ...reviewComments]) {
if (new Date(comment.created_at) < cutoff) continue;
const login = comment.user?.login;
if (!login) continue;

Expand Down Expand Up @@ -100,31 +146,32 @@ jobs:
}

if (highRiskAccounts.size === 0) {
console.log('\n✅ No high-risk accounts detected. Skipping report.');
appendSummary(`✅ Bot Detection: no new accounts (<${MIN_ACCOUNT_AGE_DAYS}d) found in last ${HOURS_BACK}h.`);
return;
}

console.log(`\n🚨 Found ${highRiskAccounts.size} high-risk account(s)`);

// Fetch additional activity for high-risk accounts
for (const [login, data] of highRiskAccounts) {
console.log(` 📊 Fetching activity for @${login}...`);

try {
const { data: issues } = await github.rest.issues.listByRepo({
const { data: issues } = await github.rest.issues.listForRepo({
owner: context.repo.owner,
repo: context.repo.repo,
creator: login,
state: 'all',
});
data.issues = issues.map(i => ({
number: i.number,
title: i.title,
created_at: i.created_at,
html_url: i.html_url,
}));
data.issues = issues
.filter(i => !i.pull_request)
.filter(i => new Date(i.created_at) >= cutoff)
.map(i => ({
number: i.number,
title: i.title,
state: i.state,
created_at: i.created_at,
html_url: i.html_url,
}));
} catch (e) {
console.log(` ⚠️ Could not fetch issues for ${login}`);
console.log(`Could not fetch issues for ${login}`);
}

try {
Expand All @@ -135,21 +182,41 @@ jobs:
per_page: 100,
});
data.prs = prList
.filter(p => p.user?.login === login)
.filter(p => p.user?.login === login && new Date(p.created_at) >= cutoff)
.map(p => ({
number: p.number,
title: p.title,
state: p.state,
created_at: p.created_at,
html_url: p.html_url,
}));
Comment on lines 177 to 192
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR fetching logic fetches all PRs in the repository with only per_page: 100, then client-side filters for the specific user. This approach has two problems:

  1. It only fetches the first 100 PRs total (not paginated), so it will miss PRs from the user if they're not in the most recent 100 PRs
  2. It's inefficient - fetching all PRs when you only need one user's PRs

The API doesn't support filtering by creator for pulls.list, but you could use the Search API instead: github.rest.search.issuesAndPullRequests with query repo:owner/repo type:pr author:login which supports pagination and filtering server-side. Alternatively, if the user typically has few PRs, consider using github.rest.pulls.list with pagination and filtering client-side, but fetch more than 100 results.

This issue also appears on line 263 of the same file.

Copilot uses AI. Check for mistakes.
} catch (e) {
console.log(` ⚠️ Could not fetch PRs for ${login}`);
console.log(`Could not fetch PRs for ${login}`);
}
}

// Skip alerting if everything found is already closed.
let hasAnyOpenItem = false;
for (const [, data] of highRiskAccounts) {
if (data.issues?.some(i => i.state === 'open')) {
hasAnyOpenItem = true;
break;
}
if (data.prs?.some(p => p.state === 'open')) {
hasAnyOpenItem = true;
break;
}
}

if (!hasAnyOpenItem) {
console.log('No open issues or PRs from new accounts; skipping alert issue.');
appendSummary('Bot Detection: flagged new accounts, but all related issues/PRs are closed. No alert issue created.');
return;
}
Comment on lines +198 to +215
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic to skip alerting when all issues/PRs are closed (lines 198-215) is a good improvement that reduces alert fatigue. However, consider that accounts with closed items might still warrant investigation - a bot that quickly creates and closes spam issues could bypass detection with this logic.

Consider adding a threshold or alternative check, such as:

  1. Only skip if items were closed by the account owner themselves (not by moderators)
  2. Only skip if items have been closed for more than a certain time period
  3. Track the closure patterns in the report even if not creating an alert

This would help identify sophisticated bots that clean up after themselves.

Copilot uses AI. Check for mistakes.

// Build report
const today = new Date().toISOString().split('T')[0];
body = `Recently-created accounts often indicate bots, spam accounts, or coordinated attacks.\n\n`;
let body = `Recently-created accounts often indicate bots, spam accounts, or coordinated attacks.\n\n`;

const sorted = Array.from(highRiskAccounts.entries()).sort((a, b) => a[1].daysOld - b[1].daysOld);

Expand Down Expand Up @@ -185,17 +252,62 @@ jobs:
}

if (!data.issues?.length && !data.prs?.length && !data.comments?.length) {
body += `*(No issues, PRs, or comments in the last ${DAYS_BACK} days)*\n\n`;
body += `*(No issues, PRs, or comments in the last ${HOURS_BACK} hours)*\n\n`;
}
}

console.log('\n📤 Creating security alert issue...');
await github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: `🚨 HIGH RISK: Brand New Accounts — ${today}`,
body,
labels: ['security', 'bot-detection'],
});
console.log('\nCreating security alert issue...');
const title = `🚨 HIGH RISK: Brand New Accounts — ${today}`;
let existingIssueNumber;
Comment on lines +260 to +261
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation: lines 260-261 are indented with additional spaces compared to line 259. These lines should align with the same indentation level as line 259 since they are all part of the same code block. This creates inconsistent formatting and could be confusing for maintainers.

Suggested change
const title = `🚨 HIGH RISK: Brand New Accounts — ${today}`;
let existingIssueNumber;
const title = `🚨 HIGH RISK: Brand New Accounts — ${today}`;
let existingIssueNumber;

Copilot uses AI. Check for mistakes.

console.log('✅ Report created successfully');
try {
const { data: existingIssues } = await github.rest.issues.listForRepo({
owner: context.repo.owner,
repo: context.repo.repo,
state: 'open',
per_page: 100,
});

const existing = existingIssues.find(i => i.title === title);
if (existing?.number) {
existingIssueNumber = existing.number;
}
} catch (e) {
// If listing issues fails, fall back to creating a new issue.
}

try {
if (existingIssueNumber) {
await github.rest.issues.update({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: existingIssueNumber,
body,
});
} else {
await github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title,
body,
labels: ['security', 'bot-detection'],
});
}
} catch (e) {
console.log('Issue create/update with labels failed; retrying without labels...');
if (existingIssueNumber) {
await github.rest.issues.update({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: existingIssueNumber,
body,
});
} else {
await github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title,
body,
});
}
}
Comment on lines +296 to +313
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling in the retry logic (catch block at line 296) doesn't prevent errors from propagating. If the retry attempt also fails, the error will be unhandled and could crash the workflow. Consider:

  1. Wrapping the retry attempts in their own try-catch blocks
  2. Logging the final error if all retry attempts fail
  3. Using appendSummary to report the failure in the workflow summary

For example, the inner attempts (lines 299-311) should have their own try-catch to ensure failures are logged properly.

Copilot uses AI. Check for mistakes.