This is a Python-based tool that downloads pages from a Fandom (MediaWiki) wiki, parses inconsistent wiki markup and embedded HTML, and converts the content of stored tables into structured, searchable JSON data.
It is built to handle irregular formatting that stems from inconsistent naming conventions and structure commonly found in community wikis
- Downloads wiki pages via the MediaWiki (Fandom) API
- Saves raw page content locally in JSON format
- Parses MediaWiki markup using
mwparserfromhell - Extracts and cleans HTML using
BeautifulSoup - Uses regex-based processing for edge cases and irregular formatting
- Normalizes data into structured JSON objects
- Includes a basic keyword search script for querying parsed content
download_pages.py– Fetches pages from the Fandom (MediaWiki) API and stores raw content as JSONfandom_parser.py– Parses wiki markup and embedded HTML to separate tables into semi-structured datajson_cleaner.py– Parses json structure to clean results of HTML and wiki markup *(Specifically designed for the Shadow Slave wiki due to inconsistent and unconventional logic)search.py– Provides basic keyword search over parsed results
- Python
- MediaWiki / Fandom API
- mwparserfromhell
- BeautifulSoup4
- re (regex)
- JSON
- Pages are fetched from the wiki API and saved locally
- Raw wiki markup is parsed and table data is extracted
- Data is serialized into structured JSON objects
- Embedded HTML is cleaned or removed
- Inconsistent formatting is normalized through custom parsing logic
- A search script allows querying across parsed pages
- Install dependencies
- Run the download script
- The download script automatically runs the parsing and cleaning scripts
- Use the search script to query results
- Parsing logic is tailored to a specific wiki's formatting conventions
- Limited to extracting and formatting information stored in tables
- Not intended to be a universal MediaWiki parser
- Search functionality is intentionally limited
- Generalize for multiple wikis
- Add indexing for faster search
- Export to CSV or database formats
- Improve error handling and logging
- Add some sort of fuzzy finding to search descriptions
- Create a GUI to search parsed data
- Implement a search function in a discord bot using discord.py, as well as sqlite, to allow for community use (interest has been shown)