Skip to content

MediaWiki API Project: Language Equity (Foreign Languages) #229

@chinaexpert1

Description

@chinaexpert1

Overview

Measure language equity on sensitive topics by comparing how quickly key Wikipedia pages are created and updated across languages. Output a reproducible dataset and dashboard that quantify coverage (existence), “time-to-translation,” and update lag versus a reference language (e.g., English).

Action Items

If this is the beginning (research & design)

  • Define scope: start with ~50 English Wikipedia pages in sensitive domains (public health, migration, elections, climate disasters, human rights).
  • Choose language set: top 30 Wikipedias by article count + 10 low-resource languages for contrast (via sitematrix), or all languages returned by each page’s langlinks.
  • Finalize metrics & windows: coverage status, time-to-first-presence (page exists Y/N and when), and update lag (delta between latest edit timestamps across languages).
  • Methods plan: use MediaWiki Action API (prop=langlinks, prop=revisions, list=sitematrix) and optionally Pageviews REST API for context; decide reference language(s) for lags.
  • Tooling pairs: requests or httpx; pandas or polars; storage in duckdb or sqlite; viz in Altair or Plotly.

If researched and ready (implementation steps)

  1. Seed topics

    • Create a topics.csv of English page titles and (optional) Wikidata QIDs for disambiguation.
  2. Enumerate languages & interlanguage links

    • For each English title: prop=langlinks to get language codes and titles; merge with sitematrix to validate wiki domains.
  3. Fetch revision metadata

    • For English and each linked language title: prop=revisions&rvprop=timestamp|size|userid|comment&rvlimit=1&rvdir=older/newer as needed to get first and latest revision timestamps.
  4. Compute metrics

    • Coverage: page present (1/0).
    • Time-to-presence: first non-English creation time minus English creation time.
    • Update lag: latest English edit time minus latest non-English edit time (days).
    • Optional robustness: compare against median of top-N languages instead of only English.
  5. Deliver

    • Artifacts: pages_raw.parquet, lang_presence.parquet, lags.parquet, metrics.csv.
    • Dashboard: heatmap (languages × topics) of update lags; coverage bar charts; language ranking tables.
    • Methods README with exact queries, rate-limit/backoff notes, and caveats.
  6. Quality & Ops

    • Caching of responses; exponential backoff on maxlag.
    • Unit tests for merge logic and timestamp math; schema checks.
    • (Optional) Monthly GitHub Action to refresh a subset.

Resources/Instructions

API docs to pin in the repo

  • MediaWiki Action API overview: API:Action_API
  • Interlanguage links: API:Langlinks
  • Revisions metadata: API:Revisions
  • Site matrix (language list): API:Sitematrix
  • Pageviews (REST; optional context): Wikimedia REST API: Pageviews

Suggested libraries (pick pairs)

  • HTTP: requests | httpx
  • Frames: pandas | polars
  • Storage: duckdb | sqlite
  • Viz: Altair | Plotly
  • Parsing (optional): mwparserfromhell | wikitextparser

Sample queries to copy into notes

# Interlanguage links for an English page
action=query&prop=langlinks&titles=Migration_crisis&lllimit=max

# Latest revision timestamp for a given title on any wiki
action=query&prop=revisions&rvprop=timestamp|size&rvlimit=1&titles=<TITLE>

# First revision timestamp (creation): request oldest
action=query&prop=revisions&rvprop=timestamp&rvlimit=1&rvdir=newer&titles=<TITLE>

# Languages list (code ↔ wiki mapping)
action=sitematrix&formatversion=2

Ethics & reporting

  • Avoid punitive “league tables.” Emphasize resource constraints and volunteer capacity.

  • Aggregate results; no profiling of individual editors.

  • Document missingness (pages that never existed, renamed pages, disambiguation).

  • Be explicit that latest edit time ≠ content parity (it’s a proxy).

  • If this issue requires access to 311 data, please answer the following questions:

    • Not applicable.
    • N/A
    • N/A
    • N/A

Project Outline (detailed plan for this idea) in details below:

Details

Research question
Do sensitive-topic pages appear and stay up-to-date across languages at similar speeds, or do we observe systematic coverage gaps and update lags?

Data sources & modules

  • prop=langlinks to discover cross-language equivalents for each English seed page.
  • list=sitematrix to enumerate languages and validate wiki domains.
  • prop=revisions to get first and latest timestamps per page/language.
  • (Optional) Pageviews REST API to contextualize demand vs. freshness.

Method

  1. Build a seed list of English pages in public health, migration, elections, climate disasters, and human rights (store in topics.csv).
  2. For each page, pull langlinks to get target language titles; validate with sitematrix.
  3. For English + each language title, fetch first and latest revision timestamps.
  4. Compute metrics per (topic, language): coverage, time-to-presence, update lag; summarize by language family/region.
  5. Visualize a lag heatmap (languages × topics), coverage distributions, and top-lagging topics.

Key metrics

  • Coverage rate (% of topics that exist per language).
  • Median time-to-presence (days).
  • Median update lag (days) and % of topics with lag > thresholds (e.g., >30, >90 days).
  • (Optional) Correlate lag with pageviews to see “high-demand but stale” cases.

Deliverables

  • Clean tables (lang_presence.parquet, first_latest_revisions.parquet, lags.parquet).
  • Reproducible notebook + reports/language_equity.md.
  • Streamlit/Altair dashboard with filters (topic set, language subset, thresholds).

Caveats & limitations

  • Timestamp proxies don’t guarantee semantic parity; translations may be partial.
  • Some languages may title pages differently or merge topics; handle redirects carefully.
  • API throttling and maxlag require polite batching and retries.

Implementation notes

  • Normalize timestamps to UTC; compute diffs in days (float).
  • Use stable keys: (wiki_db, pageid) when available; fall back to (lang_code, normalized_title).
  • Cache raw JSON responses and write a manifest of query params for reproducibility.

Resources

Metadata

Metadata

Assignees

Type

No type

Projects

Status

New Issue Approval

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions