SQL-style query engine for HTML
MarkQL is a SQL-style query engine for HTML that lets you select precisely what you need, filter to the relevant parts of a page, and extract structured fields using the familiar SELECT ... FROM ... WHERE ... flow, rather than relying on brittle, ad-hoc scraping logic.
Prerequisites:
- CMake 3.16+
- A C++20 compiler
- Boost (multiprecision); set
-DMARKQL_ENABLE_KHMER_NUMBER=OFFto skip Boost - Optional dependencies:
libxml2,curl,nlohmann_json,arrow/parquet
Ubuntu/Debian/WSL (minimal packages):
sudo apt update
sudo apt install -y \
git ca-certificates pkg-config \
build-essential cmake ninja-build \
libboost-devOptional feature packages:
sudo apt install -y libxml2-dev libcurl4-openssl-dev nlohmann-json3-devArrow/Parquet packages (often missing on older distros):
sudo apt install -y libarrow-dev libparquet-devmacOS (Homebrew):
xcode-select --install
brew install cmake ninja pkg-config boostOptional feature packages:
brew install libxml2 curl nlohmann-jsonArrow/Parquet:
brew install apache-arrowBuild (project default):
./scripts/build/build.shMinimal build when optional dependencies are unavailable:
cmake -S . -B build \
-DMARKQL_WITH_LIBXML2=OFF \
-DMARKQL_WITH_CURL=OFF \
-DMARKQL_WITH_ARROW=OFF \
-DMARKQL_WITH_NLOHMANN_JSON=OFF \
-DMARKQL_BUILD_AGENT=ON \
-DMARKQL_AGENT_FETCH_DEPS=ON
cmake --build buildTo build without Boost, add -DMARKQL_ENABLE_KHMER_NUMBER=OFF.
Run one query:
./build/markql --query "SELECT div FROM doc LIMIT 5;" --input ./data/index.htmlRun interactive REPL:
./build/markql --interactive --input ./data/index.htmlCurrent desktop releases ship three user-facing assets:
MarkQL-Desktop-<version>-linux-x86_64.AppImageMarkQL-Desktop-<version>-windows-x86_64.msimarkql-extension.zip
Python package releases continue to use v* tags. Desktop installer releases use desktop-v* tags.
Install flow today:
- Download and install MarkQL Desktop from the latest GitHub Release.
- Download
markql-extension.zipfrom the same release and extract it. - Open
chrome://extensions. - Enable
Developer mode. - Click
Load unpacked. - Select the extracted
markql-extensionfolder. - Launch MarkQL Desktop.
- Click
Copy Token. - Paste the token into the extension.
- Open a page and run queries.
Linux AppImage note:
- If the AppImage is not executable after download, run
chmod +x MarkQL-Desktop-<version>-linux-x86_64.AppImage.
Windows note:
- The MSI is unsigned in the MVP, so Windows may show an "unknown publisher" warning.
Build and run markql-agent (localhost 127.0.0.1:7337):
./scripts/build/build.sh
./scripts/agent/start-agent.shNotes:
MARKQL_AGENT_TOKENis the primary agent token variable.scripts/agent/start-agent.shsets a default token if not provided.- A legacy agent token variable still works during the migration window.
- You can set your own token:
MARKQL_AGENT_TOKEN=your-secret-token ./scripts/agent/start-agent.shLoad the Chrome extension:
- Open
chrome://extensions - Enable
Developer mode - Click
Load unpacked - Select
browser_plugin/extension
Extension host permission:
http://127.0.0.1:7337/*
- Primary CLI binary is
./build/markql. - Legacy compatibility binary
./build/markqlis still generated. docanddocumentare both valid sources inFROM.- If
--inputis omitted, the CLI reads HTML fromstdin. - URL sources (
FROM 'https://...') requireMARKQL_WITH_CURL=ON. TO PARQUET(...)requiresMARKQL_WITH_ARROW=ON.INNER_HTML(...)returns minified HTML by default. UseRAW_INNER_HTML(...)for unmodified raw output.TO TABLE(...)supports explicit trimming/sparse options:TRIM_EMPTY_ROWS,TRIM_EMPTY_COLS,EMPTY_IS,STOP_AFTER_EMPTY_ROWS,FORMAT,SPARSE_SHAPE, andHEADER_NORMALIZE.
C++ tests:
cmake --build build --target markql_tests
ctest --test-dir build --output-on-failurePython package/tests (optional):
./scripts/python/install.sh
./scripts/python/test.shBrowser plugin UI tests (optional):
npm install
npx playwright install chromium
npm run test:browser-plugin- Book (chapter path + verified examples): docs/book/SUMMARY.md
- Canonical tutorial: docs/markql-tutorial.md
- CLI guide: docs/markql-cli-guide.md
- Editor support plan: docs/editor-support-plan.md
- VS Code extension: docs/vscode-extension.md
- Vim plugin: docs/vim-plugin.md
- Docs index: docs/README.md
- Script layout: scripts/README.md
- Changelog: CHANGELOG.md
Apache License 2.0. See LICENSE.
