Add data scrapers for simulation-theory research topics#32
Add data scrapers for simulation-theory research topics#32
Conversation
Co-authored-by: blackboxprogramming <118287761+blackboxprogramming@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a new scrapers/ module to collect external research data (arXiv papers, Wikipedia topic summaries, and OEIS sequences) aligned with the repository’s simulation-theory research themes.
Changes:
- Introduces three standalone Python scrapers (
arxiv_scraper.py,wikipedia_scraper.py,oeis_scraper.py) with CLI arguments and JSON output. - Adds
scrapers/requirements.txtandscrapers/README.mddocumenting setup/usage and output formats. - Adds a top-level
.gitignorefor common Python artifacts and local environment files.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
scrapers/arxiv_scraper.py |
New arXiv Atom API scraper producing per-query JSON results. |
scrapers/wikipedia_scraper.py |
New Wikipedia API scraper producing topic summary JSON results. |
scrapers/oeis_scraper.py |
New OEIS JSON endpoint scraper producing sequence metadata/terms JSON results. |
scrapers/requirements.txt |
Declares Python dependencies for running the scrapers. |
scrapers/README.md |
Documents installation, usage, and output formats for the scrapers. |
.gitignore |
Adds Python build/cache ignores and a few local file patterns. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| page = next(iter(pages.values())) | ||
|
|
||
| if "missing" in page: |
There was a problem hiding this comment.
page = next(iter(pages.values())) will raise StopIteration if the API response has no pages (e.g., unexpected response shape). Consider guarding for empty pages and returning an empty result (or raising a clearer error) to keep the scraper robust.
| page = next(iter(pages.values())) | |
| if "missing" in page: | |
| page = next(iter(pages.values()), None) | |
| if not page or "missing" in page: |
| resp = requests.get(ARXIV_API, params=params, timeout=30) | ||
| resp.raise_for_status() | ||
|
|
||
| root = ET.fromstring(resp.text) |
There was a problem hiding this comment.
ET.fromstring(resp.text) can raise xml.etree.ElementTree.ParseError (e.g., partial responses, HTML error pages) but only requests.RequestException is caught. This can crash the run despite the doc/PR description claiming graceful failures; consider catching ET.ParseError (and possibly returning an empty list) in scrape or fetch_papers.
| root = ET.fromstring(resp.text) | |
| try: | |
| root = ET.fromstring(resp.text) | |
| except ET.ParseError: | |
| # Malformed or non-XML response; fail gracefully with no papers. | |
| return [] |
| for query in queries: | ||
| print(f"Fetching: {query!r} …") | ||
| try: | ||
| results[query] = fetch_papers(query, max_results=max_per_query) | ||
| except requests.RequestException as exc: | ||
| print(f" Error: {exc}") | ||
| results[query] = [] | ||
| time.sleep(1) # be polite to the API |
There was a problem hiding this comment.
Progress/error messages are printed to stdout during scraping, which contaminates the JSON output when --output is not provided (stdout will not be valid JSON). Route logs to stderr (or gate them behind a verbosity flag) so stdout remains pure JSON as documented.
|
|
||
| ## Output format | ||
|
|
||
| All scrapers output JSON to stdout by default, or to a file with `--output`. |
There was a problem hiding this comment.
This README states that scrapers "output JSON to stdout by default", but the scripts also print progress/error lines to stdout, which makes stdout not valid JSON. Either update the scripts to log to stderr or adjust the documentation to reflect the mixed output.
| All scrapers output JSON to stdout by default, or to a file with `--output`. | |
| All scrapers emit their results as JSON. When run without `--output`, JSON is printed to stdout along with occasional human-readable progress or error messages; use `--output` to write clean JSON to a file. |
| resp = requests.get(WIKIPEDIA_API, params=params, timeout=30) | ||
| resp.raise_for_status() | ||
| data = resp.json() |
There was a problem hiding this comment.
resp.json() can raise a ValueError/JSONDecodeError if Wikipedia returns a non-JSON response (transient upstream issue, proxy, HTML error page). Right now only requests.RequestException is handled, so the scraper can still crash despite the "graceful" goal; consider catching JSON decode errors as well.
| print(f"Fetching: {topic!r} …") | ||
| try: | ||
| results.append(fetch_summary(topic)) | ||
| except requests.RequestException as exc: | ||
| print(f" Error: {exc}") | ||
| results.append({"topic": topic, "title": topic, "url": "", "summary": ""}) | ||
| time.sleep(0.5) # be polite |
There was a problem hiding this comment.
Progress/error messages are printed to stdout during scraping, which contaminates the JSON output when --output is not provided (the stdout stream will no longer be valid JSON). Send logs to stderr (or add a --quiet/--verbose flag) so stdout can remain machine-readable JSON as documented.
| return { | ||
| "id": oeis_id, | ||
| "name": seq.get("name", ""), | ||
| "description": seq.get("comment", [""])[0] if seq.get("comment") else "", | ||
| "values": seq.get("data", "").split(",")[:20], # first 20 terms |
There was a problem hiding this comment.
If OEIS returns no data field (or it is empty), "".split(",") yields [""], so values becomes a list containing an empty string instead of an empty list. Consider normalizing this (e.g., return [] when data is missing/blank, and strip whitespace from terms).
| return { | |
| "id": oeis_id, | |
| "name": seq.get("name", ""), | |
| "description": seq.get("comment", [""])[0] if seq.get("comment") else "", | |
| "values": seq.get("data", "").split(",")[:20], # first 20 terms | |
| # Normalize the data field: handle missing/blank data and strip whitespace. | |
| data_str = seq.get("data", "") | |
| if not data_str or not str(data_str).strip(): | |
| values = [] | |
| else: | |
| # Split on commas, strip whitespace, and discard empty terms. | |
| raw_terms = str(data_str).split(",") | |
| values = [term.strip() for term in raw_terms if term.strip()] | |
| values = values[:20] # first 20 terms | |
| return { | |
| "id": oeis_id, | |
| "name": seq.get("name", ""), | |
| "description": seq.get("comment", [""])[0] if seq.get("comment") else "", | |
| "values": values, |
| for oeis_id in ids: | ||
| print(f"Fetching: {oeis_id} …") | ||
| try: | ||
| results.append(fetch_sequence(oeis_id)) | ||
| except requests.RequestException as exc: | ||
| print(f" Error: {exc}") | ||
| results.append({"id": oeis_id, "name": "", "description": "", "values": [], "url": ""}) | ||
| time.sleep(0.5) # be polite |
There was a problem hiding this comment.
Progress/error messages are printed to stdout during scraping, which contaminates the JSON output when --output is not provided (stdout will not be valid JSON). Route logs to stderr (or gate them behind a verbosity flag) so stdout remains pure JSON as documented.
| beautifulsoup4>=4.12.0 | ||
| lxml>=4.9.0 |
There was a problem hiding this comment.
beautifulsoup4 and lxml are listed as dependencies, but the scrapers currently only use requests and do not parse HTML. Dropping unused dependencies will keep installs smaller and avoid native build issues for lxml in some environments.
| beautifulsoup4>=4.12.0 | |
| lxml>=4.9.0 |
| ## Setup | ||
|
|
||
| ```bash | ||
| pip install -r requirements.txt |
There was a problem hiding this comment.
The setup command uses pip install -r requirements.txt, which only works if the current working directory is scrapers/. Consider clarifying that assumption (or use pip install -r scrapers/requirements.txt) to reduce setup confusion when run from the repo root.
| pip install -r requirements.txt | |
| # From the repo root: | |
| pip install -r scrapers/requirements.txt | |
| # Or, from within the scrapers/ directory: | |
| # pip install -r requirements.txt |
Adds a
scrapers/directory with three focused Python scrapers to collect external data relevant to the repository's research topics.Scrapers
arxiv_scraper.py— queries arXiv API for papers on simulation hypothesis, Gödel incompleteness, Riemann zeta, SHA-256 hash chains, qutrit/ternary quantum systems, IIT consciousness, and the halting problemwikipedia_scraper.py— fetches plain-text article summaries via the Wikipedia API for key topics (SHA-2, Riemann hypothesis, Euler's identity, fine-structure constant, Turing machine, DNA, Blockchain, etc.)oeis_scraper.py— pulls sequence metadata and first 20 terms from OEIS for sequences relevant to the repo (primes, Fibonacci, π digits, Euler–Mascheroni constant, Catalan numbers, partition numbers)Usage
All scrapers default to a built-in topic/sequence list aligned with the repository's research areas and output JSON to stdout. Error handling is graceful — failed fetches return empty results without crashing the run.
Also adds
.gitignorecovering Python build artifacts.Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
export.arxiv.org/usr/bin/python python arxiv_scraper.py --query simulation hypothesis computational --max 2(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.