PC Compare — Portfolio Company Validation Tool

A Flask web app that compares a quarterly portfolio baseline (Excel) against the live company list scraped from the PIF portfolio page, and reports what to add, remove, or rename.

There is one application — a browser-based web UI. There is no separate CLI tool. (An older terminal script, bkp-compare.py, has been removed; see .gitignore for the historical reference.)

Quick Start

cd webview
python3 -m venv venv
source venv/bin/activate              # Windows: venv\Scripts\activate
pip install -r requirements.txt
playwright install firefox chromium
python3 app.py

Open http://127.0.0.1:5000 and upload a baseline .xlsx file with columns: CR Name, Brand Name, VRP Sector.

How It Works

Upload — POST /upload accepts the Excel file plus scraping options (browser, headless, debug, timeout) and returns a result_id.
Background scrape — Flask spawns a thread that runs compare.scrape_website(), which traverses the PIF site's facet filters (3 Portfolio facets + 6 Ecosystem facets) using Playwright + playwright-stealth. Each facet pass scrapes the filtered company list across pagination; results are merged into a single (Company, Portfolio, Ecosystem) table. Firefox is the default (best Cloudflare bypass).
Match — enhanced_matching.enhanced_compare_companies() runs the baseline rows against the scraped list using five strategies in order: exact normalized → core name → acronym/substring → token-based → fuzzy fallback. See thresholds in webview/config.py.
Persist & summarize — results_analyzer.ResultsSummarizer writes a 3-sheet Excel report (Comparison Results, Unmatched Website Companies, Summary) and saves a row per run plus per-company history into webview/comparison_history.db (SQLite) for quarter-over-quarter trend analysis.
Poll & download — Browser polls GET /status/<result_id>, then fetches GET /summary/<result_id> and GET /download/<result_id> when complete.

Supported Baseline Templates

The app accepts two Excel formats and auto-detects which one was uploaded based on column headers.

Legacy template (existing stakeholder):

Required columns: CR Name, Brand Name, VRP Sector
Comparison: name + presence only. Sector is no longer compared because the website's card-level sector field has been removed.

New template (extended stakeholder):

Required columns: CR Name, Brand Name, Portfolio, Ecosystem (other columns are tolerated and ignored)
Comparison: name + presence + Portfolio assignment + Ecosystem assignment.
Ecosystem is sparse — rows with no Ecosystem value are reported as N/A rather than as mismatches.

Uploading a file that matches neither shape returns HTTP 400 with the expected column lists.

Project Layout

pc-compare/
├── webview/                       # The application (everything lives here)
│   ├── app.py                     # Flask routes + background processing
│   ├── compare.py                 # Playwright scraper (Cloudflare bypass)
│   ├── enhanced_matching.py       # 5-strategy matcher
│   ├── results_analyzer.py        # Summary + SQLite historical tracker
│   ├── config.py                  # WEBSITE_URL, thresholds, CSS selectors
│   ├── index.html                 # Single-page web UI
│   ├── debug_selectors.py         # Dev tool: opens visible browser to verify selectors
│   ├── requirements.txt
│   ├── uploads/                   # Uploaded baselines + generated results + debug dumps
│   └── comparison_history.db      # SQLite history (auto-created)
├── README.md                      # This file
├── SETUP.md                       # Short install walkthrough
├── CLOUDFLARE_TROUBLESHOOTING.md  # Bypass strategies + escalation path
├── requirements.txt               # Duplicate of webview/requirements.txt
└── config.py                      # Unused duplicate of webview/config.py — safe to delete

Output

Results land in webview/uploads/results_<YYYYMMDDHHMMSS>.xlsx. Status values per row:

Status	Meaning
`OK`	All applicable fields match (name + Portfolio + Ecosystem)
`Add`	In baseline, missing from website
`Remove`	On website, missing from baseline
`Requires update`	Present on both sides but at least one field differs — see per-field columns (`Match Score`, `Portfolio Match`, `Ecosystem Match`) for which

For the new template, the per-field Portfolio Match / Ecosystem Match columns take values Yes, No, or N/A (the latter when the baseline cell is empty or the company isn't on the website).

Configuration

Edit webview/config.py:

WEBSITE_URL — target portfolio page
FUZZY_MATCH_THRESHOLD (default 90) — minimum score for fuzzy name matching
SECTOR_MATCH_THRESHOLD (default 80) — minimum score for categorical (Portfolio/Ecosystem) matching
SELECTORS — CSS selectors for cookie banner, pagination, company cards, and facet panels. Update these if the PIF site changes its markup.

Cloudflare Troubleshooting

The scraper applies playwright-stealth automatically and detects Cloudflare challenge pages, waiting up to 30s (headless) or 120s (visible) for resolution. If scraping fails:

Switch to Firefox in the UI (already default).
Switch to Visible mode and solve any CAPTCHA manually.
Increase timeout to 120–180 s.
Inspect webview/uploads/error_screenshot_*.png and error_page_*.html (debug mode is on by default).
Run webview/debug_selectors.py to verify selectors haven't changed on the live page.

Full escalation path (incl. switching to camoufox or a managed bypass proxy) is in CLOUDFLARE_TROUBLESHOOTING.md.

Requirements

Python 3.8+
Playwright browsers (Firefox required, Chromium recommended)
Outbound HTTPS to pif.gov.sa

Upgrading from a previous version

The SQLite history schema changed in this release (sector columns removed, Portfolio + Ecosystem columns added). If you have an existing webview/comparison_history.db from a previous version, delete it before first run — the app will create a fresh one with the current schema. Old run history is not migrated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PC Compare — Portfolio Company Validation Tool

Quick Start

How It Works

Supported Baseline Templates

Project Layout

Output

Configuration

Cloudflare Troubleshooting

Requirements

Upgrading from a previous version

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
docs/superpowers		docs/superpowers
webview		webview
.gitignore		.gitignore
CLOUDFLARE_TROUBLESHOOTING.md		CLOUDFLARE_TROUBLESHOOTING.md
README.md		README.md
SETUP.md		SETUP.md

Folders and files

Latest commit

History

Repository files navigation

PC Compare — Portfolio Company Validation Tool

Quick Start

How It Works

Supported Baseline Templates

Project Layout

Output

Configuration

Cloudflare Troubleshooting

Requirements

Upgrading from a previous version

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages