Skip to content

VibeCoder01/pure-article

Repository files navigation

Pure Article

Extract the main article content from any URL using Mozilla Readability and JSDOM — available as a CLI and a minimal web UI.

Features

  • Clean article extraction (removes ads/nav/sidebar cruft)
  • Outputs title, optional byline, plain text, and original HTML
  • CLI with --json and --html modes
  • Simple Express web UI to paste a URL and view the result

Requirements

  • Node.js ≥ 18 (fetch, WHATWG APIs)
  • npm ci to install exact dependencies

Quick Start

CLI

  • Dev (TypeScript via tsx):
    • npm run dev -- <url> [--json] [--html]
  • Build + run:
    • npm run build
    • npm start -- <url> [--json] [--html]
  • Installed binary (after build):
    • pure-article <url> [--json] [--html]

Examples:

  • Text: npm run dev -- https://example.com/article
  • JSON: npm run dev -- https://example.com/article --json
  • HTML fragment: npm run dev -- https://example.com/article --html

Web UI

  • Dev server: npm run web then open http://localhost:3000
  • Change port: PORT=3001 npm run web or pick a random free port with PORT=0 npm run web
  • Built server: npm run build then npm run web:start

Programmatic API

import { extractArticle } from './src/index.js';

const article = await extractArticle('https://example.com/post', {
  userAgent: 'MyBot/1.0',
  timeoutMs: 15000,
});

console.log(article.title);
console.log(article.byline);
console.log(article.contentText);
// article.contentHtml contains the Readability HTML fragment

Returned shape:

  • url: string
  • title: string
  • byline?: string
  • contentText: string (plain text, paragraphs preserved)
  • contentHtml?: string (original Readability HTML)
  • excerpt?: string | null, length?: number | null, siteName?: string | null

Scripts

  • npm run dev — CLI in watch/dev mode
  • npm run web — start web UI in dev (set PORT as needed)
  • npm run build — type‑check and compile to dist/
  • npm start — run compiled CLI (node dist/cli.js)
  • npm run web:start — run compiled web server (node dist/server.js)
  • npm test — run Vitest
  • npm run lint / npm run format — ESLint / Prettier

Project Structure

  • src/ — TypeScript source (CLI, server, extractor)
  • tests/ — Vitest specs
  • dist/ — compiled JS output (generated)

Notes & Limits

  • Extraction quality depends on page markup; some sites may not parse perfectly.
  • Respect target sites’ Terms of Service and robots policies. Use responsibly.
  • Network timeouts and user‑agent can be adjusted via ExtractOptions.

Development

  • Install deps: npm ci
  • Lint/format: npm run lint / npm run format
  • Tests: npm test (add -- --coverage for coverage)

About

Take a URL, strips out the fluff, leaving the content.Extract the main article content from any URL using Mozilla Readability and JSDOM — available as a CLI and a minimal web UI. Features Clean article extraction (removes ads/nav/sidebar cruft).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors