-
Notifications
You must be signed in to change notification settings - Fork 0
Add data scrapers for simulation-theory research topics #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| __pycache__/ | ||
| *.pyc | ||
| *.pyo | ||
| *.pyd | ||
| .Python | ||
| *.egg-info/ | ||
| dist/ | ||
| build/ | ||
| .env | ||
| *.json.bak |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,94 @@ | ||||||
| # Scrapers | ||||||
|
|
||||||
| Python web scrapers for collecting data relevant to the simulation-theory research repository. | ||||||
|
|
||||||
| ## Scrapers | ||||||
|
|
||||||
| | Script | Source | Topics | | ||||||
| |--------|--------|--------| | ||||||
| | [`arxiv_scraper.py`](./arxiv_scraper.py) | [arXiv](https://arxiv.org) | Simulation hypothesis, Gödel incompleteness, Riemann zeta, qutrit/ternary quantum, halting problem, IIT consciousness | | ||||||
| | [`wikipedia_scraper.py`](./wikipedia_scraper.py) | [Wikipedia](https://en.wikipedia.org) | SHA-256, Riemann hypothesis, quantum computing, Euler's identity, fine-structure constant, Turing machine, DNA, Blockchain | | ||||||
| | [`oeis_scraper.py`](./oeis_scraper.py) | [OEIS](https://oeis.org) | Prime numbers, Fibonacci, pi digits, Euler–Mascheroni constant, Catalan numbers, partition numbers | | ||||||
|
|
||||||
| ## Setup | ||||||
|
|
||||||
| ```bash | ||||||
| pip install -r requirements.txt | ||||||
| ``` | ||||||
|
|
||||||
| ## Usage | ||||||
|
|
||||||
| ### arXiv scraper | ||||||
|
|
||||||
| ```bash | ||||||
| # Use default topic list | ||||||
| python arxiv_scraper.py | ||||||
|
|
||||||
| # Custom query, limit to 3 results per query | ||||||
| python arxiv_scraper.py --query "Riemann hypothesis zeros" --max 3 | ||||||
|
|
||||||
| # Save to file | ||||||
| python arxiv_scraper.py --output arxiv_results.json | ||||||
| ``` | ||||||
|
|
||||||
| ### Wikipedia scraper | ||||||
|
|
||||||
| ```bash | ||||||
| # Use default topic list | ||||||
| python wikipedia_scraper.py | ||||||
|
|
||||||
| # Custom topics | ||||||
| python wikipedia_scraper.py --topics "Riemann hypothesis" "SHA-2" "Turing machine" | ||||||
|
|
||||||
| # Save to file | ||||||
| python wikipedia_scraper.py --output wikipedia_results.json | ||||||
| ``` | ||||||
|
|
||||||
| ### OEIS scraper | ||||||
|
|
||||||
| ```bash | ||||||
| # Use default sequence list | ||||||
| python oeis_scraper.py | ||||||
|
|
||||||
| # Custom sequence IDs | ||||||
| python oeis_scraper.py --ids A000040 A000045 A000796 | ||||||
|
|
||||||
| # Save to file | ||||||
| python oeis_scraper.py --output oeis_results.json | ||||||
| ``` | ||||||
|
|
||||||
| ## Output format | ||||||
|
|
||||||
| All scrapers output JSON to stdout by default, or to a file with `--output`. | ||||||
|
||||||
| All scrapers output JSON to stdout by default, or to a file with `--output`. | |
| All scrapers emit their results as JSON. When run without `--output`, JSON is printed to stdout along with occasional human-readable progress or error messages; use `--output` to write clean JSON to a file. |
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,119 @@ | ||||||||||||||
| """ | ||||||||||||||
| arXiv scraper — fetches abstracts for papers related to simulation theory research topics. | ||||||||||||||
|
|
||||||||||||||
| Topics covered: simulation hypothesis, Gödel incompleteness, Riemann hypothesis, | ||||||||||||||
| quantum computation, SHA-256/cryptographic hash functions, consciousness/integrated | ||||||||||||||
| information theory, ternary/qutrit systems. | ||||||||||||||
|
|
||||||||||||||
| Usage: | ||||||||||||||
| python arxiv_scraper.py | ||||||||||||||
| python arxiv_scraper.py --query "Riemann hypothesis" --max 5 | ||||||||||||||
| python arxiv_scraper.py --output results.json | ||||||||||||||
| """ | ||||||||||||||
|
|
||||||||||||||
| import argparse | ||||||||||||||
| import json | ||||||||||||||
| import time | ||||||||||||||
| import xml.etree.ElementTree as ET | ||||||||||||||
|
|
||||||||||||||
| import requests | ||||||||||||||
|
|
||||||||||||||
| ARXIV_API = "https://export.arxiv.org/api/query" | ||||||||||||||
|
|
||||||||||||||
| DEFAULT_QUERIES = [ | ||||||||||||||
| "simulation hypothesis computational reality", | ||||||||||||||
| "Gödel incompleteness self-reference formal systems", | ||||||||||||||
| "Riemann zeta function trivial zeros", | ||||||||||||||
| "SHA-256 hash chain cryptographic proof", | ||||||||||||||
| "qutrit ternary quantum computation", | ||||||||||||||
| "integrated information theory consciousness", | ||||||||||||||
| "halting problem quantum physics undecidability", | ||||||||||||||
| ] | ||||||||||||||
|
|
||||||||||||||
| NS = {"atom": "http://www.w3.org/2005/Atom", "arxiv": "http://arxiv.org/schemas/atom"} | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| def fetch_papers(query: str, max_results: int = 5) -> list[dict]: | ||||||||||||||
| """Return a list of paper dicts for the given arXiv search query.""" | ||||||||||||||
| params = { | ||||||||||||||
| "search_query": f"all:{query}", | ||||||||||||||
| "start": 0, | ||||||||||||||
| "max_results": max_results, | ||||||||||||||
| "sortBy": "relevance", | ||||||||||||||
| "sortOrder": "descending", | ||||||||||||||
| } | ||||||||||||||
| resp = requests.get(ARXIV_API, params=params, timeout=30) | ||||||||||||||
| resp.raise_for_status() | ||||||||||||||
|
|
||||||||||||||
| root = ET.fromstring(resp.text) | ||||||||||||||
|
||||||||||||||
| root = ET.fromstring(resp.text) | |
| try: | |
| root = ET.fromstring(resp.text) | |
| except ET.ParseError: | |
| # Malformed or non-XML response; fail gracefully with no papers. | |
| return [] |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Progress/error messages are printed to stdout during scraping, which contaminates the JSON output when --output is not provided (stdout will not be valid JSON). Route logs to stderr (or gate them behind a verbosity flag) so stdout remains pure JSON as documented.
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,100 @@ | ||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||
| OEIS (On-Line Encyclopedia of Integer Sequences) scraper — fetches sequence | ||||||||||||||||||||||||||||||||||||||||||||
| metadata for integer sequences relevant to simulation-theory research. | ||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
| Sequences of interest: primes, Fibonacci, pi digits, Euler–Mascheroni constant | ||||||||||||||||||||||||||||||||||||||||||||
| digits, Pascal's triangle, Catalan numbers, SHA-256 round constants, and others. | ||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
| Usage: | ||||||||||||||||||||||||||||||||||||||||||||
| python oeis_scraper.py | ||||||||||||||||||||||||||||||||||||||||||||
| python oeis_scraper.py --ids A000040 A000045 | ||||||||||||||||||||||||||||||||||||||||||||
| python oeis_scraper.py --output results.json | ||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
| import argparse | ||||||||||||||||||||||||||||||||||||||||||||
| import json | ||||||||||||||||||||||||||||||||||||||||||||
| import time | ||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
| import requests | ||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
| OEIS_SEARCH_URL = "https://oeis.org/search" | ||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
| # Default sequence IDs relevant to the repository topics | ||||||||||||||||||||||||||||||||||||||||||||
| DEFAULT_IDS = [ | ||||||||||||||||||||||||||||||||||||||||||||
| "A000040", # prime numbers | ||||||||||||||||||||||||||||||||||||||||||||
| "A000045", # Fibonacci numbers | ||||||||||||||||||||||||||||||||||||||||||||
| "A000796", # decimal expansion of pi | ||||||||||||||||||||||||||||||||||||||||||||
| "A001620", # decimal expansion of Euler–Mascheroni constant | ||||||||||||||||||||||||||||||||||||||||||||
| "A000108", # Catalan numbers | ||||||||||||||||||||||||||||||||||||||||||||
| "A000012", # the all-1s sequence (trivial zero analogue) | ||||||||||||||||||||||||||||||||||||||||||||
| "A000720", # pi(n): number of primes <= n | ||||||||||||||||||||||||||||||||||||||||||||
| "A006862", # Euclid numbers: 1 + product of first n primes | ||||||||||||||||||||||||||||||||||||||||||||
| "A000041", # number of partitions of n | ||||||||||||||||||||||||||||||||||||||||||||
| "A001358", # semiprimes | ||||||||||||||||||||||||||||||||||||||||||||
| ] | ||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
| def fetch_sequence(oeis_id: str) -> dict: | ||||||||||||||||||||||||||||||||||||||||||||
| """Fetch metadata for a single OEIS sequence via the JSON search endpoint.""" | ||||||||||||||||||||||||||||||||||||||||||||
| params = {"q": f"id:{oeis_id}", "fmt": "json"} | ||||||||||||||||||||||||||||||||||||||||||||
| resp = requests.get(OEIS_SEARCH_URL, params=params, timeout=30) | ||||||||||||||||||||||||||||||||||||||||||||
| resp.raise_for_status() | ||||||||||||||||||||||||||||||||||||||||||||
| data = resp.json() | ||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||
| data = resp.json() | |
| try: | |
| data = resp.json() | |
| except ValueError: | |
| # OEIS returned a non-JSON response (e.g., HTML error page); return empty result. | |
| return {"id": oeis_id, "name": "", "description": "", "values": [], "url": ""} |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If OEIS returns no data field (or it is empty), "".split(",") yields [""], so values becomes a list containing an empty string instead of an empty list. Consider normalizing this (e.g., return [] when data is missing/blank, and strip whitespace from terms).
| return { | |
| "id": oeis_id, | |
| "name": seq.get("name", ""), | |
| "description": seq.get("comment", [""])[0] if seq.get("comment") else "", | |
| "values": seq.get("data", "").split(",")[:20], # first 20 terms | |
| # Normalize the data field: handle missing/blank data and strip whitespace. | |
| data_str = seq.get("data", "") | |
| if not data_str or not str(data_str).strip(): | |
| values = [] | |
| else: | |
| # Split on commas, strip whitespace, and discard empty terms. | |
| raw_terms = str(data_str).split(",") | |
| values = [term.strip() for term in raw_terms if term.strip()] | |
| values = values[:20] # first 20 terms | |
| return { | |
| "id": oeis_id, | |
| "name": seq.get("name", ""), | |
| "description": seq.get("comment", [""])[0] if seq.get("comment") else "", | |
| "values": values, |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Progress/error messages are printed to stdout during scraping, which contaminates the JSON output when --output is not provided (stdout will not be valid JSON). Route logs to stderr (or gate them behind a verbosity flag) so stdout remains pure JSON as documented.
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,3 @@ | ||||||
| requests>=2.31.0 | ||||||
| beautifulsoup4>=4.12.0 | ||||||
| lxml>=4.9.0 | ||||||
|
Comment on lines
+2
to
+3
|
||||||
| beautifulsoup4>=4.12.0 | |
| lxml>=4.9.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The setup command uses
pip install -r requirements.txt, which only works if the current working directory isscrapers/. Consider clarifying that assumption (or usepip install -r scrapers/requirements.txt) to reduce setup confusion when run from the repo root.