Wiki Table Data Extraction & Parsing Tool

Overview

This is a Python-based tool that downloads pages from a Fandom (MediaWiki) wiki, parses inconsistent wiki markup and embedded HTML, and converts the content of stored tables into structured, searchable JSON data.

It is built to handle irregular formatting that stems from inconsistent naming conventions and structure commonly found in community wikis

Features

Downloads wiki pages via the MediaWiki (Fandom) API
Saves raw page content locally in JSON format
Parses MediaWiki markup using mwparserfromhell
Extracts and cleans HTML using BeautifulSoup
Uses regex-based processing for edge cases and irregular formatting
Normalizes data into structured JSON objects
Includes a basic keyword search script for querying parsed content

Project Structure

download_pages.py – Fetches pages from the Fandom (MediaWiki) API and stores raw content as JSON
fandom_parser.py – Parses wiki markup and embedded HTML to separate tables into semi-structured data
json_cleaner.py – Parses json structure to clean results of HTML and wiki markup *(Specifically designed for the Shadow Slave wiki due to inconsistent and unconventional logic)
search.py – Provides basic keyword search over parsed results

Tech Stack

Python
MediaWiki / Fandom API
mwparserfromhell
BeautifulSoup4
re (regex)
JSON

How It Works

Pages are fetched from the wiki API and saved locally
Raw wiki markup is parsed and table data is extracted
Data is serialized into structured JSON objects
Embedded HTML is cleaned or removed
Inconsistent formatting is normalized through custom parsing logic
A search script allows querying across parsed pages

Usage

Install dependencies
Run the download script
The download script automatically runs the parsing and cleaning scripts
Use the search script to query results

Limitations

Parsing logic is tailored to a specific wiki's formatting conventions
Limited to extracting and formatting information stored in tables
Not intended to be a universal MediaWiki parser
Search functionality is intentionally limited

Future Improvements

Generalize for multiple wikis
Add indexing for faster search
Export to CSV or database formats
Improve error handling and logging
Add some sort of fuzzy finding to search descriptions
Create a GUI to search parsed data
Implement a search function in a discord bot using discord.py, as well as sqlite, to allow for community use (interest has been shown)

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.gitignore		.gitignore
README.md		README.md
download_pages.py		download_pages.py
fandom_parser.py		fandom_parser.py
json_cleaner.py		json_cleaner.py
search.py		search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wiki Table Data Extraction & Parsing Tool

Overview

Features

Project Structure

Tech Stack

How It Works

Usage

Limitations

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wiki Table Data Extraction & Parsing Tool

Overview

Features

Project Structure

Tech Stack

How It Works

Usage

Limitations

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages