Skip to content

Add popular-crate BC dataset collector with per-BC llvm.reserved linkage#6

Merged
Coursant merged 3 commits intomainfrom
copilot/download-crate-sources-generate-bc-json
Mar 23, 2026
Merged

Add popular-crate BC dataset collector with per-BC llvm.reserved linkage#6
Coursant merged 3 commits intomainfrom
copilot/download-crate-sources-generate-bc-json

Conversation

Copy link
Copy Markdown

Copilot AI commented Mar 23, 2026

This PR adds a Python-based data collection pipeline for building a bounds-check (BC) dataset from popular open-source crates. It automates crate download + rapx -O execution, then emits dataset rows where each BC is linked to its corresponding llvm.reserved marker (or explicitly marked unmatched).

  • New dataset collection script

    • Added collect_popular_crates_bc_dataset.py at repo root.
    • Pulls top crates from crates.io by downloads (--top-n).
    • Downloads/extracts crate source, runs cargo +<toolchain> rapx -O -- --locked (with fallback cargo rapx -O).
    • Locates generated bounds-check JSON (bounds_checks*.json) and copies raw JSON per crate.
  • BC ↔ LLVM reserved mapping in dataset rows

    • Builds bc_dataset.jsonl with one row per BC entry.
    • Matching strategy:
      • First: ID-based keys (llvm_reserved_id / reserved_id / llvm_id / marker_id / id)
      • Then: file+line matching (file/filename + line/line_no)
    • If no match is found, row is retained with:
      • llvm_reserved: null
      • llvm_reserved_matched: false
    • This avoids synthetic/incorrect pairings while preserving full BC coverage.
  • Output artifacts + run manifest

    • raw_json/*.json: raw per-crate BC JSON
    • bc_dataset.jsonl: final dataset
    • manifest.json: per-crate status/counts (bc_count, reserved_count, dataset_rows, unmatched_rows, failure states)
  • Docs update

    • Updated README.md with usage, output layout, and the llvm_reserved_matched semantics.
  • Extraction hardening

    • Added tar extraction path validation to reject unsafe archive entries before unpacking.

Example usage:

python3 collect_popular_crates_bc_dataset.py \
  --top-n 10 \
  --output-dir dataset_bc \
  --toolchain nightly-2025-12-06

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 2 commits March 23, 2026 08:17
Copilot AI changed the title [WIP] Add script to download popular crate sources and generate bc json Add popular-crate BC dataset collector with per-BC llvm.reserved linkage Mar 23, 2026
Copilot AI requested a review from Coursant March 23, 2026 08:24
Copy link
Copy Markdown
Owner

@Coursant Coursant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bc dataset

@Coursant Coursant marked this pull request as ready for review March 23, 2026 08:25
@Coursant Coursant merged commit 18b4be5 into main Mar 23, 2026
Copilot stopped work on behalf of Coursant due to an error March 23, 2026 08:29
@Coursant Coursant deleted the copilot/download-crate-sources-generate-bc-json branch March 23, 2026 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants