Add data scrapers for simulation-theory research topics by Copilot · Pull Request #32 · blackboxprogramming/simulation-theory

Copilot · 2026-02-25T18:17:15Z

Adds a scrapers/ directory with three focused Python scrapers to collect external data relevant to the repository's research topics.

Scrapers

arxiv_scraper.py — queries arXiv API for papers on simulation hypothesis, Gödel incompleteness, Riemann zeta, SHA-256 hash chains, qutrit/ternary quantum systems, IIT consciousness, and the halting problem
wikipedia_scraper.py — fetches plain-text article summaries via the Wikipedia API for key topics (SHA-2, Riemann hypothesis, Euler's identity, fine-structure constant, Turing machine, DNA, Blockchain, etc.)
oeis_scraper.py — pulls sequence metadata and first 20 terms from OEIS for sequences relevant to the repo (primes, Fibonacci, π digits, Euler–Mascheroni constant, Catalan numbers, partition numbers)

Usage

pip install -r scrapers/requirements.txt

# arXiv — custom query, 3 results max
python scrapers/arxiv_scraper.py --query "Riemann hypothesis zeros" --max 3

# Wikipedia — custom topics
python scrapers/wikipedia_scraper.py --topics "SHA-2" "Turing machine"

# OEIS — custom sequence IDs
python scrapers/oeis_scraper.py --ids A000040 A000045 A000796

# All scrapers support --output <file.json> to persist results

All scrapers default to a built-in topic/sequence list aligned with the repository's research areas and output JSON to stdout. Error handling is graceful — failed fetches return empty results without crashing the run.

Also adds .gitignore covering Python build artifacts.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

export.arxiv.org
- Triggering command: /usr/bin/python python arxiv_scraper.py --query simulation hypothesis computational --max 2 (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

I want scrapers

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

Co-authored-by: blackboxprogramming <118287761+blackboxprogramming@users.noreply.github.com>

Copilot

Pull request overview

Adds a new scrapers/ module to collect external research data (arXiv papers, Wikipedia topic summaries, and OEIS sequences) aligned with the repository’s simulation-theory research themes.

Changes:

Introduces three standalone Python scrapers (arxiv_scraper.py, wikipedia_scraper.py, oeis_scraper.py) with CLI arguments and JSON output.
Adds scrapers/requirements.txt and scrapers/README.md documenting setup/usage and output formats.
Adds a top-level .gitignore for common Python artifacts and local environment files.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
`scrapers/arxiv_scraper.py`	New arXiv Atom API scraper producing per-query JSON results.
`scrapers/wikipedia_scraper.py`	New Wikipedia API scraper producing topic summary JSON results.
`scrapers/oeis_scraper.py`	New OEIS JSON endpoint scraper producing sequence metadata/terms JSON results.
`scrapers/requirements.txt`	Declares Python dependencies for running the scrapers.
`scrapers/README.md`	Documents installation, usage, and output formats for the scrapers.
`.gitignore`	Adds Python build/cache ignores and a few local file patterns.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-25T18:37:34Z

scrapers/wikipedia_scraper.py

+    page = next(iter(pages.values()))
+
+    if "missing" in page:


page = next(iter(pages.values())) will raise StopIteration if the API response has no pages (e.g., unexpected response shape). Consider guarding for empty pages and returning an empty result (or raising a clearer error) to keep the scraper robust.

Suggested change

page = next(iter(pages.values()))

if "missing" in page:

page = next(iter(pages.values()), None)

if not page or "missing" in page:

Copilot · 2026-02-25T18:37:34Z

scrapers/arxiv_scraper.py

+    resp = requests.get(ARXIV_API, params=params, timeout=30)
+    resp.raise_for_status()
+
+    root = ET.fromstring(resp.text)


ET.fromstring(resp.text) can raise xml.etree.ElementTree.ParseError (e.g., partial responses, HTML error pages) but only requests.RequestException is caught. This can crash the run despite the doc/PR description claiming graceful failures; consider catching ET.ParseError (and possibly returning an empty list) in scrape or fetch_papers.

Suggested change

root = ET.fromstring(resp.text)

try:

root = ET.fromstring(resp.text)

except ET.ParseError:

# Malformed or non-XML response; fail gracefully with no papers.

return []

Copilot · 2026-02-25T18:37:35Z

scrapers/arxiv_scraper.py

+    for query in queries:
+        print(f"Fetching: {query!r} …")
+        try:
+            results[query] = fetch_papers(query, max_results=max_per_query)
+        except requests.RequestException as exc:
+            print(f"  Error: {exc}")
+            results[query] = []
+        time.sleep(1)  # be polite to the API


Progress/error messages are printed to stdout during scraping, which contaminates the JSON output when --output is not provided (stdout will not be valid JSON). Route logs to stderr (or gate them behind a verbosity flag) so stdout remains pure JSON as documented.

Copilot · 2026-02-25T18:37:35Z

scrapers/README.md

+
+## Output format
+
+All scrapers output JSON to stdout by default, or to a file with `--output`.


This README states that scrapers "output JSON to stdout by default", but the scripts also print progress/error lines to stdout, which makes stdout not valid JSON. Either update the scripts to log to stderr or adjust the documentation to reflect the mixed output.

Suggested change

All scrapers output JSON to stdout by default, or to a file with `--output`.

All scrapers emit their results as JSON. When run without `--output`, JSON is printed to stdout along with occasional human-readable progress or error messages; use `--output` to write clean JSON to a file.

Copilot · 2026-02-25T18:37:35Z

scrapers/wikipedia_scraper.py

+    resp = requests.get(WIKIPEDIA_API, params=params, timeout=30)
+    resp.raise_for_status()
+    data = resp.json()


resp.json() can raise a ValueError/JSONDecodeError if Wikipedia returns a non-JSON response (transient upstream issue, proxy, HTML error page). Right now only requests.RequestException is handled, so the scraper can still crash despite the "graceful" goal; consider catching JSON decode errors as well.

Copilot · 2026-02-25T18:37:36Z

scrapers/wikipedia_scraper.py

+        print(f"Fetching: {topic!r} …")
+        try:
+            results.append(fetch_summary(topic))
+        except requests.RequestException as exc:
+            print(f"  Error: {exc}")
+            results.append({"topic": topic, "title": topic, "url": "", "summary": ""})
+        time.sleep(0.5)  # be polite


Progress/error messages are printed to stdout during scraping, which contaminates the JSON output when --output is not provided (the stdout stream will no longer be valid JSON). Send logs to stderr (or add a --quiet/--verbose flag) so stdout can remain machine-readable JSON as documented.

Copilot · 2026-02-25T18:37:36Z

scrapers/oeis_scraper.py

+    return {
+        "id": oeis_id,
+        "name": seq.get("name", ""),
+        "description": seq.get("comment", [""])[0] if seq.get("comment") else "",
+        "values": seq.get("data", "").split(",")[:20],  # first 20 terms


If OEIS returns no data field (or it is empty), "".split(",") yields [""], so values becomes a list containing an empty string instead of an empty list. Consider normalizing this (e.g., return [] when data is missing/blank, and strip whitespace from terms).

Suggested change

return {

"id": oeis_id,

"name": seq.get("name", ""),

"description": seq.get("comment", [""])[0] if seq.get("comment") else "",

"values": seq.get("data", "").split(",")[:20], # first 20 terms

# Normalize the data field: handle missing/blank data and strip whitespace.

data_str = seq.get("data", "")

if not data_str or not str(data_str).strip():

values = []

else:

# Split on commas, strip whitespace, and discard empty terms.

raw_terms = str(data_str).split(",")

values = [term.strip() for term in raw_terms if term.strip()]

values = values[:20] # first 20 terms

return {

"id": oeis_id,

"name": seq.get("name", ""),

"description": seq.get("comment", [""])[0] if seq.get("comment") else "",

"values": values,

Copilot · 2026-02-25T18:37:36Z

scrapers/oeis_scraper.py

+    for oeis_id in ids:
+        print(f"Fetching: {oeis_id} …")
+        try:
+            results.append(fetch_sequence(oeis_id))
+        except requests.RequestException as exc:
+            print(f"  Error: {exc}")
+            results.append({"id": oeis_id, "name": "", "description": "", "values": [], "url": ""})
+        time.sleep(0.5)  # be polite


Progress/error messages are printed to stdout during scraping, which contaminates the JSON output when --output is not provided (stdout will not be valid JSON). Route logs to stderr (or gate them behind a verbosity flag) so stdout remains pure JSON as documented.

Copilot · 2026-02-25T18:37:36Z

scrapers/requirements.txt

+beautifulsoup4>=4.12.0
+lxml>=4.9.0


beautifulsoup4 and lxml are listed as dependencies, but the scrapers currently only use requests and do not parse HTML. Dropping unused dependencies will keep installs smaller and avoid native build issues for lxml in some environments.

Suggested change

beautifulsoup4>=4.12.0

lxml>=4.9.0

Copilot · 2026-02-25T18:37:37Z

scrapers/README.md

+## Setup
+
+```bash
+pip install -r requirements.txt


The setup command uses pip install -r requirements.txt, which only works if the current working directory is scrapers/. Consider clarifying that assumption (or use pip install -r scrapers/requirements.txt) to reduce setup confusion when run from the repo root.

Suggested change

pip install -r requirements.txt

# From the repo root:

pip install -r scrapers/requirements.txt

# Or, from within the scrapers/ directory:

# pip install -r requirements.txt

Initial plan

740aa11

Copilot AI assigned Copilot and blackboxprogramming Feb 25, 2026

Copilot started work on behalf of blackboxprogramming February 25, 2026 18:17 View session

Add scrapers for arXiv, Wikipedia, and OEIS

6879279

Co-authored-by: blackboxprogramming <118287761+blackboxprogramming@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add functionality for data scrapers~~ Add data scrapers for simulation-theory research topics Feb 25, 2026

Copilot AI requested a review from blackboxprogramming February 25, 2026 18:20

Copilot finished work on behalf of blackboxprogramming February 25, 2026 18:20

blackboxprogramming approved these changes Feb 25, 2026

View reviewed changes

blackboxprogramming requested a review from Copilot February 25, 2026 18:31

Copilot started reviewing on behalf of blackboxprogramming February 25, 2026 18:32 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data scrapers for simulation-theory research topics#32

Add data scrapers for simulation-theory research topics#32
Copilot wants to merge 2 commits intomainfrom
copilot/add-scrapers-functionality

Copilot AI commented Feb 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-    root = ET.fromstring(resp.text)
+    try:
+        root = ET.fromstring(resp.text)
+    except ET.ParseError:
+        # Malformed or non-XML response; fail gracefully with no papers.
+        return []


		## Output format

		All scrapers output JSON to stdout by default, or to a file with `--output`.

	All scrapers output JSON to stdout by default, or to a file with `--output`.
	All scrapers emit their results as JSON. When run without `--output`, JSON is printed to stdout along with occasional human-readable progress or error messages; use `--output` to write clean JSON to a file.

-pip install -r requirements.txt
+# From the repo root:
+pip install -r scrapers/requirements.txt
+# Or, from within the scrapers/ directory:
+# pip install -r requirements.txt

Conversation

Copilot AI commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scrapers

Usage

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Feb 25, 2026 •

edited

Loading