Skip to content

darhnoel/markql

Repository files navigation

MarkQL logo

MarkQL

SQL-style query engine for HTML

Build wheels Release Desktop Status: Beta GitHub tag PyPI version PyPI downloads License

MarkQL is a SQL-style query engine for HTML that lets you select precisely what you need, filter to the relevant parts of a page, and extract structured fields using the familiar SELECT ... FROM ... WHERE ... flow, rather than relying on brittle, ad-hoc scraping logic.

Demo

MarkQL quick tutorial GIF

Quick Start

Prerequisites:

  • CMake 3.16+
  • A C++20 compiler
  • Boost (multiprecision); set -DMARKQL_ENABLE_KHMER_NUMBER=OFF to skip Boost
  • Optional dependencies: libxml2, curl, nlohmann_json, arrow/parquet

Ubuntu/Debian/WSL (minimal packages):

sudo apt update
sudo apt install -y \
  git ca-certificates pkg-config \
  build-essential cmake ninja-build \
  libboost-dev

Optional feature packages:

sudo apt install -y libxml2-dev libcurl4-openssl-dev nlohmann-json3-dev

Arrow/Parquet packages (often missing on older distros):

sudo apt install -y libarrow-dev libparquet-dev

macOS (Homebrew):

xcode-select --install
brew install cmake ninja pkg-config boost

Optional feature packages:

brew install libxml2 curl nlohmann-json

Arrow/Parquet:

brew install apache-arrow

Build (project default):

./scripts/build/build.sh

Minimal build when optional dependencies are unavailable:

cmake -S . -B build \
  -DMARKQL_WITH_LIBXML2=OFF \
  -DMARKQL_WITH_CURL=OFF \
  -DMARKQL_WITH_ARROW=OFF \
  -DMARKQL_WITH_NLOHMANN_JSON=OFF \
  -DMARKQL_BUILD_AGENT=ON \
  -DMARKQL_AGENT_FETCH_DEPS=ON
cmake --build build

To build without Boost, add -DMARKQL_ENABLE_KHMER_NUMBER=OFF.

Run one query:

./build/markql --query "SELECT div FROM doc LIMIT 5;" --input ./data/index.html

Run interactive REPL:

./build/markql --interactive --input ./data/index.html

Install MarkQL Desktop

Current desktop releases ship three user-facing assets:

  • MarkQL-Desktop-<version>-linux-x86_64.AppImage
  • MarkQL-Desktop-<version>-windows-x86_64.msi
  • markql-extension.zip

Python package releases continue to use v* tags. Desktop installer releases use desktop-v* tags.

Install flow today:

  1. Download and install MarkQL Desktop from the latest GitHub Release.
  2. Download markql-extension.zip from the same release and extract it.
  3. Open chrome://extensions.
  4. Enable Developer mode.
  5. Click Load unpacked.
  6. Select the extracted markql-extension folder.
  7. Launch MarkQL Desktop.
  8. Click Copy Token.
  9. Paste the token into the extension.
  10. Open a page and run queries.

Linux AppImage note:

  • If the AppImage is not executable after download, run chmod +x MarkQL-Desktop-<version>-linux-x86_64.AppImage.

Windows note:

  • The MSI is unsigned in the MVP, so Windows may show an "unknown publisher" warning.

Browser Plugin MVP

Build and run markql-agent (localhost 127.0.0.1:7337):

./scripts/build/build.sh
./scripts/agent/start-agent.sh

Notes:

  • MARKQL_AGENT_TOKEN is the primary agent token variable.
  • scripts/agent/start-agent.sh sets a default token if not provided.
  • A legacy agent token variable still works during the migration window.
  • You can set your own token:
MARKQL_AGENT_TOKEN=your-secret-token ./scripts/agent/start-agent.sh

Load the Chrome extension:

  1. Open chrome://extensions
  2. Enable Developer mode
  3. Click Load unpacked
  4. Select browser_plugin/extension

Extension host permission:

  • http://127.0.0.1:7337/*

CLI Notes

  • Primary CLI binary is ./build/markql.
  • Legacy compatibility binary ./build/markql is still generated.
  • doc and document are both valid sources in FROM.
  • If --input is omitted, the CLI reads HTML from stdin.
  • URL sources (FROM 'https://...') require MARKQL_WITH_CURL=ON.
  • TO PARQUET(...) requires MARKQL_WITH_ARROW=ON.
  • INNER_HTML(...) returns minified HTML by default. Use RAW_INNER_HTML(...) for unmodified raw output.
  • TO TABLE(...) supports explicit trimming/sparse options: TRIM_EMPTY_ROWS, TRIM_EMPTY_COLS, EMPTY_IS, STOP_AFTER_EMPTY_ROWS, FORMAT, SPARSE_SHAPE, and HEADER_NORMALIZE.

Testing

C++ tests:

cmake --build build --target markql_tests
ctest --test-dir build --output-on-failure

Python package/tests (optional):

./scripts/python/install.sh
./scripts/python/test.sh

Browser plugin UI tests (optional):

npm install
npx playwright install chromium
npm run test:browser-plugin

Documentation

License

Apache License 2.0. See LICENSE.

Packages

 
 
 

Contributors