Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
__pycache__/
*.pyc
*.pyo
*.pyd
.Python
*.egg-info/
dist/
build/
.env
*.json.bak
94 changes: 94 additions & 0 deletions scrapers/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Scrapers

Python web scrapers for collecting data relevant to the simulation-theory research repository.

## Scrapers

| Script | Source | Topics |
|--------|--------|--------|
| [`arxiv_scraper.py`](./arxiv_scraper.py) | [arXiv](https://arxiv.org) | Simulation hypothesis, Gödel incompleteness, Riemann zeta, qutrit/ternary quantum, halting problem, IIT consciousness |
| [`wikipedia_scraper.py`](./wikipedia_scraper.py) | [Wikipedia](https://en.wikipedia.org) | SHA-256, Riemann hypothesis, quantum computing, Euler's identity, fine-structure constant, Turing machine, DNA, Blockchain |
| [`oeis_scraper.py`](./oeis_scraper.py) | [OEIS](https://oeis.org) | Prime numbers, Fibonacci, pi digits, Euler–Mascheroni constant, Catalan numbers, partition numbers |

## Setup

```bash
pip install -r requirements.txt
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setup command uses pip install -r requirements.txt, which only works if the current working directory is scrapers/. Consider clarifying that assumption (or use pip install -r scrapers/requirements.txt) to reduce setup confusion when run from the repo root.

Suggested change
pip install -r requirements.txt
# From the repo root:
pip install -r scrapers/requirements.txt
# Or, from within the scrapers/ directory:
# pip install -r requirements.txt

Copilot uses AI. Check for mistakes.
```

## Usage

### arXiv scraper

```bash
# Use default topic list
python arxiv_scraper.py

# Custom query, limit to 3 results per query
python arxiv_scraper.py --query "Riemann hypothesis zeros" --max 3

# Save to file
python arxiv_scraper.py --output arxiv_results.json
```

### Wikipedia scraper

```bash
# Use default topic list
python wikipedia_scraper.py

# Custom topics
python wikipedia_scraper.py --topics "Riemann hypothesis" "SHA-2" "Turing machine"

# Save to file
python wikipedia_scraper.py --output wikipedia_results.json
```

### OEIS scraper

```bash
# Use default sequence list
python oeis_scraper.py

# Custom sequence IDs
python oeis_scraper.py --ids A000040 A000045 A000796

# Save to file
python oeis_scraper.py --output oeis_results.json
```

## Output format

All scrapers output JSON to stdout by default, or to a file with `--output`.
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This README states that scrapers "output JSON to stdout by default", but the scripts also print progress/error lines to stdout, which makes stdout not valid JSON. Either update the scripts to log to stderr or adjust the documentation to reflect the mixed output.

Suggested change
All scrapers output JSON to stdout by default, or to a file with `--output`.
All scrapers emit their results as JSON. When run without `--output`, JSON is printed to stdout along with occasional human-readable progress or error messages; use `--output` to write clean JSON to a file.

Copilot uses AI. Check for mistakes.

**arXiv** — dict keyed by query, each value is a list of:
```json
{
"title": "...",
"authors": ["..."],
"published": "2024-01-01T00:00:00Z",
"abstract": "...",
"url": "https://arxiv.org/abs/..."
}
```

**Wikipedia** — list of:
```json
{
"topic": "SHA-2",
"title": "SHA-2",
"url": "https://en.wikipedia.org/wiki/SHA-2",
"summary": "..."
}
```

**OEIS** — list of:
```json
{
"id": "A000040",
"name": "The prime numbers.",
"description": "...",
"values": ["2", "3", "5", "7", "11", "..."],
"url": "https://oeis.org/A000040"
}
```
119 changes: 119 additions & 0 deletions scrapers/arxiv_scraper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
"""
arXiv scraper — fetches abstracts for papers related to simulation theory research topics.

Topics covered: simulation hypothesis, Gödel incompleteness, Riemann hypothesis,
quantum computation, SHA-256/cryptographic hash functions, consciousness/integrated
information theory, ternary/qutrit systems.

Usage:
python arxiv_scraper.py
python arxiv_scraper.py --query "Riemann hypothesis" --max 5
python arxiv_scraper.py --output results.json
"""

import argparse
import json
import time
import xml.etree.ElementTree as ET

import requests

ARXIV_API = "https://export.arxiv.org/api/query"

DEFAULT_QUERIES = [
"simulation hypothesis computational reality",
"Gödel incompleteness self-reference formal systems",
"Riemann zeta function trivial zeros",
"SHA-256 hash chain cryptographic proof",
"qutrit ternary quantum computation",
"integrated information theory consciousness",
"halting problem quantum physics undecidability",
]

NS = {"atom": "http://www.w3.org/2005/Atom", "arxiv": "http://arxiv.org/schemas/atom"}


def fetch_papers(query: str, max_results: int = 5) -> list[dict]:
"""Return a list of paper dicts for the given arXiv search query."""
params = {
"search_query": f"all:{query}",
"start": 0,
"max_results": max_results,
"sortBy": "relevance",
"sortOrder": "descending",
}
resp = requests.get(ARXIV_API, params=params, timeout=30)
resp.raise_for_status()

root = ET.fromstring(resp.text)
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ET.fromstring(resp.text) can raise xml.etree.ElementTree.ParseError (e.g., partial responses, HTML error pages) but only requests.RequestException is caught. This can crash the run despite the doc/PR description claiming graceful failures; consider catching ET.ParseError (and possibly returning an empty list) in scrape or fetch_papers.

Suggested change
root = ET.fromstring(resp.text)
try:
root = ET.fromstring(resp.text)
except ET.ParseError:
# Malformed or non-XML response; fail gracefully with no papers.
return []

Copilot uses AI. Check for mistakes.
papers = []
for entry in root.findall("atom:entry", NS):
title_el = entry.find("atom:title", NS)
summary_el = entry.find("atom:summary", NS)
id_el = entry.find("atom:id", NS)
published_el = entry.find("atom:published", NS)
authors = [
a.find("atom:name", NS).text
for a in entry.findall("atom:author", NS)
if a.find("atom:name", NS) is not None
]
papers.append(
{
"title": title_el.text.strip() if title_el is not None else "",
"authors": authors,
"published": published_el.text.strip() if published_el is not None else "",
"abstract": summary_el.text.strip() if summary_el is not None else "",
"url": id_el.text.strip() if id_el is not None else "",
}
)
return papers


def scrape(queries: list[str], max_per_query: int = 5) -> dict[str, list[dict]]:
"""Scrape arXiv for each query and return results keyed by query string."""
results = {}
for query in queries:
print(f"Fetching: {query!r} …")
try:
results[query] = fetch_papers(query, max_results=max_per_query)
except requests.RequestException as exc:
print(f" Error: {exc}")
results[query] = []
time.sleep(1) # be polite to the API
Comment on lines +75 to +82
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Progress/error messages are printed to stdout during scraping, which contaminates the JSON output when --output is not provided (stdout will not be valid JSON). Route logs to stderr (or gate them behind a verbosity flag) so stdout remains pure JSON as documented.

Copilot uses AI. Check for mistakes.
return results


def main() -> None:
parser = argparse.ArgumentParser(description="Scrape arXiv for simulation-theory topics.")
parser.add_argument(
"--query",
nargs="*",
default=DEFAULT_QUERIES,
help="Search queries (defaults to built-in topic list).",
)
parser.add_argument(
"--max",
type=int,
default=5,
dest="max_results",
help="Maximum results per query (default: 5).",
)
parser.add_argument(
"--output",
default=None,
help="Write results to a JSON file instead of stdout.",
)
args = parser.parse_args()

results = scrape(args.query, max_per_query=args.max_results)

if args.output:
with open(args.output, "w", encoding="utf-8") as fh:
json.dump(results, fh, indent=2, ensure_ascii=False)
print(f"Results written to {args.output}")
else:
print(json.dumps(results, indent=2, ensure_ascii=False))


if __name__ == "__main__":
main()
100 changes: 100 additions & 0 deletions scrapers/oeis_scraper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
"""
OEIS (On-Line Encyclopedia of Integer Sequences) scraper — fetches sequence
metadata for integer sequences relevant to simulation-theory research.

Sequences of interest: primes, Fibonacci, pi digits, Euler–Mascheroni constant
digits, Pascal's triangle, Catalan numbers, SHA-256 round constants, and others.

Usage:
python oeis_scraper.py
python oeis_scraper.py --ids A000040 A000045
python oeis_scraper.py --output results.json
"""

import argparse
import json
import time

import requests

OEIS_SEARCH_URL = "https://oeis.org/search"

# Default sequence IDs relevant to the repository topics
DEFAULT_IDS = [
"A000040", # prime numbers
"A000045", # Fibonacci numbers
"A000796", # decimal expansion of pi
"A001620", # decimal expansion of Euler–Mascheroni constant
"A000108", # Catalan numbers
"A000012", # the all-1s sequence (trivial zero analogue)
"A000720", # pi(n): number of primes <= n
"A006862", # Euclid numbers: 1 + product of first n primes
"A000041", # number of partitions of n
"A001358", # semiprimes
]


def fetch_sequence(oeis_id: str) -> dict:
"""Fetch metadata for a single OEIS sequence via the JSON search endpoint."""
params = {"q": f"id:{oeis_id}", "fmt": "json"}
resp = requests.get(OEIS_SEARCH_URL, params=params, timeout=30)
resp.raise_for_status()
data = resp.json()
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resp.json() can raise a ValueError/JSONDecodeError if OEIS returns a non-JSON response (e.g., upstream error page). Currently only requests.RequestException is caught, so this can still crash the run; consider catching JSON decode errors and returning an empty result to match the scraper's "graceful" behavior.

Suggested change
data = resp.json()
try:
data = resp.json()
except ValueError:
# OEIS returned a non-JSON response (e.g., HTML error page); return empty result.
return {"id": oeis_id, "name": "", "description": "", "values": [], "url": ""}

Copilot uses AI. Check for mistakes.

results = data.get("results") or []
if not results:
return {"id": oeis_id, "name": "", "description": "", "values": [], "url": ""}

seq = results[0]
return {
"id": oeis_id,
"name": seq.get("name", ""),
"description": seq.get("comment", [""])[0] if seq.get("comment") else "",
"values": seq.get("data", "").split(",")[:20], # first 20 terms
Comment on lines +49 to +53
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If OEIS returns no data field (or it is empty), "".split(",") yields [""], so values becomes a list containing an empty string instead of an empty list. Consider normalizing this (e.g., return [] when data is missing/blank, and strip whitespace from terms).

Suggested change
return {
"id": oeis_id,
"name": seq.get("name", ""),
"description": seq.get("comment", [""])[0] if seq.get("comment") else "",
"values": seq.get("data", "").split(",")[:20], # first 20 terms
# Normalize the data field: handle missing/blank data and strip whitespace.
data_str = seq.get("data", "")
if not data_str or not str(data_str).strip():
values = []
else:
# Split on commas, strip whitespace, and discard empty terms.
raw_terms = str(data_str).split(",")
values = [term.strip() for term in raw_terms if term.strip()]
values = values[:20] # first 20 terms
return {
"id": oeis_id,
"name": seq.get("name", ""),
"description": seq.get("comment", [""])[0] if seq.get("comment") else "",
"values": values,

Copilot uses AI. Check for mistakes.
"url": f"https://oeis.org/{oeis_id}",
}


def scrape(ids: list[str]) -> list[dict]:
"""Scrape OEIS for each sequence ID."""
results = []
for oeis_id in ids:
print(f"Fetching: {oeis_id} …")
try:
results.append(fetch_sequence(oeis_id))
except requests.RequestException as exc:
print(f" Error: {exc}")
results.append({"id": oeis_id, "name": "", "description": "", "values": [], "url": ""})
time.sleep(0.5) # be polite
Comment on lines +61 to +68
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Progress/error messages are printed to stdout during scraping, which contaminates the JSON output when --output is not provided (stdout will not be valid JSON). Route logs to stderr (or gate them behind a verbosity flag) so stdout remains pure JSON as documented.

Copilot uses AI. Check for mistakes.
return results


def main() -> None:
parser = argparse.ArgumentParser(
description="Scrape OEIS sequences relevant to simulation-theory research."
)
parser.add_argument(
"--ids",
nargs="*",
default=DEFAULT_IDS,
help="OEIS sequence IDs (e.g. A000040). Defaults to built-in list.",
)
parser.add_argument(
"--output",
default=None,
help="Write results to a JSON file instead of stdout.",
)
args = parser.parse_args()

results = scrape(args.ids)

if args.output:
with open(args.output, "w", encoding="utf-8") as fh:
json.dump(results, fh, indent=2, ensure_ascii=False)
print(f"Results written to {args.output}")
else:
print(json.dumps(results, indent=2, ensure_ascii=False))


if __name__ == "__main__":
main()
3 changes: 3 additions & 0 deletions scrapers/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
requests>=2.31.0
beautifulsoup4>=4.12.0
lxml>=4.9.0
Comment on lines +2 to +3
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautifulsoup4 and lxml are listed as dependencies, but the scrapers currently only use requests and do not parse HTML. Dropping unused dependencies will keep installs smaller and avoid native build issues for lxml in some environments.

Suggested change
beautifulsoup4>=4.12.0
lxml>=4.9.0

Copilot uses AI. Check for mistakes.
Loading