Skip to content

Add data scrapers for simulation-theory research topics#32

Draft
Copilot wants to merge 2 commits intomainfrom
copilot/add-scrapers-functionality
Draft

Add data scrapers for simulation-theory research topics#32
Copilot wants to merge 2 commits intomainfrom
copilot/add-scrapers-functionality

Conversation

Copy link
Contributor

Copilot AI commented Feb 25, 2026

Adds a scrapers/ directory with three focused Python scrapers to collect external data relevant to the repository's research topics.

Scrapers

  • arxiv_scraper.py — queries arXiv API for papers on simulation hypothesis, Gödel incompleteness, Riemann zeta, SHA-256 hash chains, qutrit/ternary quantum systems, IIT consciousness, and the halting problem
  • wikipedia_scraper.py — fetches plain-text article summaries via the Wikipedia API for key topics (SHA-2, Riemann hypothesis, Euler's identity, fine-structure constant, Turing machine, DNA, Blockchain, etc.)
  • oeis_scraper.py — pulls sequence metadata and first 20 terms from OEIS for sequences relevant to the repo (primes, Fibonacci, π digits, Euler–Mascheroni constant, Catalan numbers, partition numbers)

Usage

pip install -r scrapers/requirements.txt

# arXiv — custom query, 3 results max
python scrapers/arxiv_scraper.py --query "Riemann hypothesis zeros" --max 3

# Wikipedia — custom topics
python scrapers/wikipedia_scraper.py --topics "SHA-2" "Turing machine"

# OEIS — custom sequence IDs
python scrapers/oeis_scraper.py --ids A000040 A000045 A000796

# All scrapers support --output <file.json> to persist results

All scrapers default to a built-in topic/sequence list aligned with the repository's research areas and output JSON to stdout. Error handling is graceful — failed fetches return empty results without crashing the run.

Also adds .gitignore covering Python build artifacts.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • export.arxiv.org
    • Triggering command: /usr/bin/python python arxiv_scraper.py --query simulation hypothesis computational --max 2 (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

I want scrapers


🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

Co-authored-by: blackboxprogramming <118287761+blackboxprogramming@users.noreply.github.com>
Copilot AI changed the title [WIP] Add functionality for data scrapers Add data scrapers for simulation-theory research topics Feb 25, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new scrapers/ module to collect external research data (arXiv papers, Wikipedia topic summaries, and OEIS sequences) aligned with the repository’s simulation-theory research themes.

Changes:

  • Introduces three standalone Python scrapers (arxiv_scraper.py, wikipedia_scraper.py, oeis_scraper.py) with CLI arguments and JSON output.
  • Adds scrapers/requirements.txt and scrapers/README.md documenting setup/usage and output formats.
  • Adds a top-level .gitignore for common Python artifacts and local environment files.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
scrapers/arxiv_scraper.py New arXiv Atom API scraper producing per-query JSON results.
scrapers/wikipedia_scraper.py New Wikipedia API scraper producing topic summary JSON results.
scrapers/oeis_scraper.py New OEIS JSON endpoint scraper producing sequence metadata/terms JSON results.
scrapers/requirements.txt Declares Python dependencies for running the scrapers.
scrapers/README.md Documents installation, usage, and output formats for the scrapers.
.gitignore Adds Python build/cache ignores and a few local file patterns.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +59 to +61
page = next(iter(pages.values()))

if "missing" in page:
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

page = next(iter(pages.values())) will raise StopIteration if the API response has no pages (e.g., unexpected response shape). Consider guarding for empty pages and returning an empty result (or raising a clearer error) to keep the scraper robust.

Suggested change
page = next(iter(pages.values()))
if "missing" in page:
page = next(iter(pages.values()), None)
if not page or "missing" in page:

Copilot uses AI. Check for mistakes.
resp = requests.get(ARXIV_API, params=params, timeout=30)
resp.raise_for_status()

root = ET.fromstring(resp.text)
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ET.fromstring(resp.text) can raise xml.etree.ElementTree.ParseError (e.g., partial responses, HTML error pages) but only requests.RequestException is caught. This can crash the run despite the doc/PR description claiming graceful failures; consider catching ET.ParseError (and possibly returning an empty list) in scrape or fetch_papers.

Suggested change
root = ET.fromstring(resp.text)
try:
root = ET.fromstring(resp.text)
except ET.ParseError:
# Malformed or non-XML response; fail gracefully with no papers.
return []

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +82
for query in queries:
print(f"Fetching: {query!r} …")
try:
results[query] = fetch_papers(query, max_results=max_per_query)
except requests.RequestException as exc:
print(f" Error: {exc}")
results[query] = []
time.sleep(1) # be polite to the API
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Progress/error messages are printed to stdout during scraping, which contaminates the JSON output when --output is not provided (stdout will not be valid JSON). Route logs to stderr (or gate them behind a verbosity flag) so stdout remains pure JSON as documented.

Copilot uses AI. Check for mistakes.

## Output format

All scrapers output JSON to stdout by default, or to a file with `--output`.
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This README states that scrapers "output JSON to stdout by default", but the scripts also print progress/error lines to stdout, which makes stdout not valid JSON. Either update the scripts to log to stderr or adjust the documentation to reflect the mixed output.

Suggested change
All scrapers output JSON to stdout by default, or to a file with `--output`.
All scrapers emit their results as JSON. When run without `--output`, JSON is printed to stdout along with occasional human-readable progress or error messages; use `--output` to write clean JSON to a file.

Copilot uses AI. Check for mistakes.
Comment on lines +54 to +56
resp = requests.get(WIKIPEDIA_API, params=params, timeout=30)
resp.raise_for_status()
data = resp.json()
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resp.json() can raise a ValueError/JSONDecodeError if Wikipedia returns a non-JSON response (transient upstream issue, proxy, HTML error page). Right now only requests.RequestException is handled, so the scraper can still crash despite the "graceful" goal; consider catching JSON decode errors as well.

Copilot uses AI. Check for mistakes.
Comment on lines +76 to +82
print(f"Fetching: {topic!r} …")
try:
results.append(fetch_summary(topic))
except requests.RequestException as exc:
print(f" Error: {exc}")
results.append({"topic": topic, "title": topic, "url": "", "summary": ""})
time.sleep(0.5) # be polite
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Progress/error messages are printed to stdout during scraping, which contaminates the JSON output when --output is not provided (the stdout stream will no longer be valid JSON). Send logs to stderr (or add a --quiet/--verbose flag) so stdout can remain machine-readable JSON as documented.

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +53
return {
"id": oeis_id,
"name": seq.get("name", ""),
"description": seq.get("comment", [""])[0] if seq.get("comment") else "",
"values": seq.get("data", "").split(",")[:20], # first 20 terms
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If OEIS returns no data field (or it is empty), "".split(",") yields [""], so values becomes a list containing an empty string instead of an empty list. Consider normalizing this (e.g., return [] when data is missing/blank, and strip whitespace from terms).

Suggested change
return {
"id": oeis_id,
"name": seq.get("name", ""),
"description": seq.get("comment", [""])[0] if seq.get("comment") else "",
"values": seq.get("data", "").split(",")[:20], # first 20 terms
# Normalize the data field: handle missing/blank data and strip whitespace.
data_str = seq.get("data", "")
if not data_str or not str(data_str).strip():
values = []
else:
# Split on commas, strip whitespace, and discard empty terms.
raw_terms = str(data_str).split(",")
values = [term.strip() for term in raw_terms if term.strip()]
values = values[:20] # first 20 terms
return {
"id": oeis_id,
"name": seq.get("name", ""),
"description": seq.get("comment", [""])[0] if seq.get("comment") else "",
"values": values,

Copilot uses AI. Check for mistakes.
Comment on lines +61 to +68
for oeis_id in ids:
print(f"Fetching: {oeis_id} …")
try:
results.append(fetch_sequence(oeis_id))
except requests.RequestException as exc:
print(f" Error: {exc}")
results.append({"id": oeis_id, "name": "", "description": "", "values": [], "url": ""})
time.sleep(0.5) # be polite
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Progress/error messages are printed to stdout during scraping, which contaminates the JSON output when --output is not provided (stdout will not be valid JSON). Route logs to stderr (or gate them behind a verbosity flag) so stdout remains pure JSON as documented.

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +3
beautifulsoup4>=4.12.0
lxml>=4.9.0
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautifulsoup4 and lxml are listed as dependencies, but the scrapers currently only use requests and do not parse HTML. Dropping unused dependencies will keep installs smaller and avoid native build issues for lxml in some environments.

Suggested change
beautifulsoup4>=4.12.0
lxml>=4.9.0

Copilot uses AI. Check for mistakes.
## Setup

```bash
pip install -r requirements.txt
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setup command uses pip install -r requirements.txt, which only works if the current working directory is scrapers/. Consider clarifying that assumption (or use pip install -r scrapers/requirements.txt) to reduce setup confusion when run from the repo root.

Suggested change
pip install -r requirements.txt
# From the repo root:
pip install -r scrapers/requirements.txt
# Or, from within the scrapers/ directory:
# pip install -r requirements.txt

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants