Skip to content

Inderdeep01/prometheus-sd-scanner

Repository files navigation

prometheus-sd-scanner

A network scanner that discovers Prometheus exporters by scanning subnets for configurable TCP ports and serves the results via HTTP for Prometheus/vmagent http_sd_configs.

Features

  • Multi-port scanning: Scan for any TCP ports (node_exporter, custom exporters, etc.)
  • Concurrent scanning: Worker pool with configurable parallelism
  • Rate limiting: Prevent network storms with configurable rate limits
  • Reverse DNS: Resolve IPs to hostnames via PTR records with caching
  • http_sd compatible: Native Prometheus HTTP service discovery format
  • Port filtering: Filter targets by port via query parameter
  • Prometheus metrics: Built-in /metrics endpoint for monitoring the scanner itself
  • Single binary: No runtime dependencies

Architecture

High-Level Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                         HOW IT WORKS                                         │
└─────────────────────────────────────────────────────────────────────────────┘

   Network Subnets                    Scanner                     Consumers
  ┌──────────────────┐           ┌──────────────────┐        ┌──────────────────┐
  │                  │           │                  │        │                  │
  │  10.174.41.0/24  │           │  prometheus-sd-  │        │     vmagent      │
  │  10.174.64.0/24  │──TCP───▶  │     scanner      │──HTTP──│   (Prometheus)   │
  │  10.174.65.0/24  │  scan     │                  │  GET   │                  │
  │  10.174.71.0/23  │           │  :8080           │        │  http_sd_configs │
  │                  │           │                  │        │                  │
  └──────────────────┘           └──────────────────┘        └──────────────────┘

         Ports scanned:                Endpoints:              Refresh interval:
         9100 (node_exporter)          /targets.json           30 seconds
         8080 (livesegmenter)          /health
         8081 (mediamuxer)             /metrics
         4999, 9090

Internal Components

┌─────────────────────────────────────────────────────────────────────────────┐
│                      SCANNER INTERNAL ARCHITECTURE                           │
└─────────────────────────────────────────────────────────────────────────────┘

  ┌─────────────────────────────────────────────────────────────────────────┐
  │                              main.go                                     │
  │   - CLI flags & environment variable parsing                            │
  │   - Signal handling (SIGINT, SIGTERM) for graceful shutdown             │
  │   - Orchestrates: initial scan → HTTP server → periodic scans           │
  └─────────────────────────────────────────────────────────────────────────┘
                                     │
          ┌──────────────────────────┼──────────────────────────┐
          │                          │                          │
          ▼                          ▼                          ▼
  ┌───────────────────┐    ┌───────────────────┐    ┌───────────────────┐
  │    config.go      │    │    scanner.go     │    │     http.go       │
  │                   │    │                   │    │                   │
  │  • Load config    │    │  • CIDR parsing   │    │  • /targets.json  │
  │  • CLI flags      │    │  • Worker pool    │    │  • /health        │
  │  • Env overrides  │    │  • Rate limiter   │    │  • /metrics       │
  │  • Validation     │    │  • TCP connect    │    │  • Port filtering │
  └───────────────────┘    │  • Result storage │    └───────────────────┘
                           └─────────┬─────────┘
                                     │
                      ┌──────────────┴──────────────┐
                      │                             │
                      ▼                             ▼
              ┌───────────────────┐       ┌───────────────────┐
              │     dns.go        │       │    metrics.go     │
              │                   │       │                   │
              │  • PTR lookups    │       │  • Prometheus     │
              │  • Result cache   │       │    gauges         │
              │  • TTL (10 min)   │       │  • Counters       │
              │  • Fallback: IP   │       │  • Histograms     │
              └───────────────────┘       └───────────────────┘

Scan Process Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                         SCANNING PROCESS                                     │
└─────────────────────────────────────────────────────────────────────────────┘

  Step 1: ENUMERATE HOSTS
  ───────────────────────────────────────────────────────────────────────────
  CIDR: 10.174.71.0/23

  Network:   10.174.71.0   ──▶ (skipped)
  Broadcast: 10.174.72.255 ──▶ (skipped)
  Usable:    10.174.71.1 → 10.174.72.254 ──▶ 510 hosts


  Step 2: GENERATE SCAN JOBS
  ───────────────────────────────────────────────────────────────────────────
  For each host × each port:

  ┌─────────────────────────────────────────────────────────────────────────┐
  │ {10.174.71.1, 9100}  {10.174.71.1, 8080}  {10.174.71.1, 8081}  ...    │
  │ {10.174.71.2, 9100}  {10.174.71.2, 8080}  {10.174.71.2, 8081}  ...    │
  │ ...                                                                    │
  └─────────────────────────────────────────────────────────────────────────┘

  Example: 1272 hosts × 5 ports = 6360 jobs


  Step 3: WORKER POOL PROCESSING
  ───────────────────────────────────────────────────────────────────────────

        Jobs Channel                    Worker Pool (N=64)
       ┌───────────────┐          ┌─────────────────────────────┐
       │Job│Job│Job│...│─────────▶│ W1 │ W2 │ W3 │ ... │ W64  │
       └───────────────┘          └─────────────┬───────────────┘
                                                │
                                                ▼
                                    Rate Limiter: 200 conn/sec


  Step 4: TCP CONNECTION TEST
  ───────────────────────────────────────────────────────────────────────────

      Scanner                         Target
         │                              │
         │────── SYN ──────────────────▶│
         │                              │
         │◀───── SYN-ACK ──────────────│   Port OPEN ✓
         │                              │
         │────── RST ──────────────────▶│   (close)
         │                              │

      Timeout: 2 seconds
      If no response → port closed/filtered


  Step 5: DNS RESOLUTION (for open ports)
  ───────────────────────────────────────────────────────────────────────────

      IP: 10.174.71.50

      ┌─────────────────────────────────────────────────────────────────┐
      │  1. Check cache (TTL: 10 min)                                   │
      │  2. If miss → PTR lookup: dig -x 10.174.71.50                  │
      │  3. Result: stsgl01p1.psr-paytv.smf1.mobitv                    │
      │  4. If PTR fails → use IP as hostname                          │
      └─────────────────────────────────────────────────────────────────┘


  Step 6: STORE RESULTS (atomic)
  ───────────────────────────────────────────────────────────────────────────

      type Target struct {
          IP       string   // "10.174.71.50"
          Port     int      // 9100
          Hostname string   // "stsgl01p1.psr-paytv.smf1.mobitv"
      }

      Results swapped atomically (mutex-protected)
      No partial results visible during scan

Concurrency Model

┌─────────────────────────────────────────────────────────────────────────────┐
│                         GOROUTINE ARCHITECTURE                               │
└─────────────────────────────────────────────────────────────────────────────┘

                              Main Goroutine
                         ┌─────────────────────┐
                         │ • Context management│
                         │ • Signal handling   │
                         │ • Scan ticker (5min)│
                         └──────────┬──────────┘
                                    │
           ┌────────────────────────┼────────────────────────┐
           │                        │                        │
           ▼                        ▼                        ▼
    ┌─────────────┐        ┌─────────────┐        ┌─────────────────┐
    │   Scanner   │        │ HTTP Server │        │ Periodic Scan   │
    │  Goroutine  │        │  Goroutine  │        │   Goroutine     │
    └──────┬──────┘        └─────────────┘        └─────────────────┘
           │
           │ Spawns worker pool for each scan
           ▼
    ┌─────────────────────────────────────────────────────────────────┐
    │                    Worker Pool (64 goroutines)                   │
    │                                                                  │
    │   ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐         ┌────┐            │
    │   │ W1 │ │ W2 │ │ W3 │ │ W4 │ │ W5 │  . . .  │W64 │            │
    │   └──┬─┘ └──┬─┘ └──┬─┘ └──┬─┘ └──┬─┘         └──┬─┘            │
    │      │      │      │      │      │              │               │
    │      └──────┴──────┴──────┴──────┴──────────────┘               │
    │                          │                                       │
    │                          ▼                                       │
    │                 ┌─────────────────┐                              │
    │                 │ Results Channel │                              │
    │                 └────────┬────────┘                              │
    │                          │                                       │
    │                          ▼                                       │
    │                 ┌─────────────────┐                              │
    │                 │    Collector    │                              │
    │                 │    Goroutine    │                              │
    │                 └─────────────────┘                              │
    └─────────────────────────────────────────────────────────────────┘

    Synchronization:
    • WaitGroup  - ensures all workers complete
    • Mutex      - protects result storage
    • Channels   - job distribution & result collection
    • Context    - graceful cancellation

Integration with Monitoring Stack

┌─────────────────────────────────────────────────────────────────────────────┐
│                    MONITORING STACK INTEGRATION                              │
└─────────────────────────────────────────────────────────────────────────────┘

  ┌─────────────────┐
  │ Target Hosts    │
  │                 │
  │ node_exporter   │◀────────────────────────────────────────┐
  │ :9100           │                                         │
  │                 │                                         │
  │ livesegmenter   │◀────────────────────────────────────┐   │
  │ :8080           │                                     │   │
  │                 │                                     │   │  Scrapes
  │ mediamuxer      │◀────────────────────────────────┐   │   │  metrics
  │ :8081           │                                 │   │   │
  └─────────────────┘                                 │   │   │
          ▲                                           │   │   │
          │ TCP scan                                  │   │   │
          │                                           │   │   │
  ┌───────┴─────────┐      HTTP GET             ┌────┴───┴───┴────┐
  │ prometheus-sd-  │      /targets.json        │                 │
  │    scanner      │◀──────────────────────────│     vmagent     │
  │                 │      (every 30s)          │                 │
  │  :8080          │                           └────────┬────────┘
  └─────────────────┘                                    │
                                                         │ Remote write
                                                         ▼
                                              ┌─────────────────────┐
                                              │   VictoriaMetrics   │
                                              │                     │
                                              │  vminsert:8480      │
                                              │  vmstorage          │
                                              │  vmselect:8481      │
                                              └──────────┬──────────┘
                                                         │
                           ┌─────────────────────────────┼─────────────────┐
                           │                             │                 │
                           ▼                             ▼                 ▼
                    ┌─────────────┐            ┌─────────────┐     ┌─────────────┐
                    │   VMAlert   │            │   Grafana   │     │   Icinga    │
                    │   :8880     │            │ (dashboards)│     │ (checks)    │
                    └──────┬──────┘            └─────────────┘     └─────────────┘
                           │
                           ▼
                    ┌─────────────┐
                    │Alertmanager │──────▶ Email / PagerDuty / Slack
                    │   :9093     │
                    └─────────────┘

Quick Start

# Build
make build

# Run locally
./bin/prometheus-sd-scanner \
    -networks "10.174.71.0/23" \
    -ports "9100,8080" \
    -interval "5m"

# Test endpoint
curl http://localhost:8080/targets.json
curl http://localhost:8080/targets.json?port=9100
curl http://localhost:8080/health
curl http://localhost:8080/metrics

Configuration

CLI Flags

Flag Default Description
-networks 10.174.71.0/23 Comma-separated CIDR ranges to scan
-ports 9100,8080,8081,4999,9090 Comma-separated TCP ports to scan
-interval 5m Scan interval (requires time unit: 5m, 300s)
-workers 16 Number of concurrent scanning goroutines
-timeout 1s TCP connection timeout
-http-port 8080 HTTP server port
-rate-limit 50 Maximum connections per second
-log-level info Log level: debug, info, warn, error
-skip-broadcast true Skip network and broadcast addresses

Environment Variables

Environment variables override CLI flags:

export SCANNER_NETWORKS="10.174.71.0/23,192.168.1.0/24"
export SCANNER_PORTS="9100,8080"
export SCANNER_INTERVAL="5m"
export SCANNER_WORKERS="32"
export SCANNER_HTTP_PORT="8080"
export SCANNER_RATE_LIMIT="100"
export SCANNER_LOG_LEVEL="debug"

Performance Tuning

┌────────────────────────────────────────────────────────────────────────────┐
│                      PERFORMANCE GUIDELINES                                 │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  Scan Time Calculation:                                                    │
│  ─────────────────────────────────────────────────────────────────────     │
│  Total jobs = hosts × ports                                                │
│  Minimum time = Total jobs / rate-limit                                    │
│                                                                            │
│  Example: 1272 hosts × 5 ports = 6360 jobs                                │
│           6360 / 200 conn/sec = 31.8 seconds (theoretical)                │
│           Actual: ~3 minutes (includes DNS, timeouts)                      │
│                                                                            │
├────────────────────────────────────────────────────────────────────────────┤
│  Setting         │ Impact                                                  │
│  ────────────────┼────────────────────────────────────────────────────     │
│  workers ↑       │ More parallelism, higher memory/network load           │
│  rate-limit ↑    │ Faster scans, risk of network congestion               │
│  timeout ↓       │ Faster scans, may miss slow-responding hosts           │
│  interval ↓      │ More frequent updates, higher resource usage           │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  Recommended Profiles:                                                     │
│  ─────────────────────────────────────────────────────────────────────     │
│  Production:    -workers 64  -rate-limit 200 -timeout 2s -interval 5m     │
│  High-freq:     -workers 128 -rate-limit 500 -timeout 1s -interval 2m     │
│  Low-impact:    -workers 16  -rate-limit 50  -timeout 3s -interval 10m    │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

HTTP Endpoints

Endpoint Description
/targets.json All discovered targets in http_sd format
/targets.json?port=9100 Filter by single port
/targets.json?port=9100,8080 Filter by multiple ports
/health Health check (returns 503 before first scan completes)
/metrics Prometheus metrics
/ API information

Health Endpoint Response

{
  "status": "healthy",
  "last_scan": "2026-02-03T14:32:00Z",
  "last_scan_ago": "2m30s",
  "last_scan_duration": "3m14s",
  "scan_interval": "5m0s",
  "targets_count": 218
}

Output Format

The /targets.json endpoint returns Prometheus http_sd compatible JSON:

[
  {
    "targets": ["stsgl01p1.psr-paytv.smf1.mobitv:9100"],
    "labels": {
      "__meta_sd_hostname": "stsgl01p1.psr-paytv.smf1.mobitv",
      "__meta_sd_ip": "10.174.71.50",
      "__meta_sd_port": "9100"
    }
  }
]

Label Reference

Label Description
__meta_sd_hostname FQDN from reverse DNS (or IP if no PTR record)
__meta_sd_ip Raw IP address
__meta_sd_port Port number as string

Prometheus/vmagent Configuration

scrape_configs:
  - job_name: 'node_exporter'
    http_sd_configs:
      - url: 'http://scanner-host:8080/targets.json?port=9100'
        refresh_interval: 30s
    relabel_configs:
      # Extract node_type from hostname prefix
      - source_labels: [__meta_sd_hostname]
        regex: '([a-zA-Z]+)\d+.*'
        target_label: node_type
        replacement: '${1}'

  - job_name: 'livesegmenter'
    http_sd_configs:
      - url: 'http://scanner-host:8080/targets.json?port=8080'
        refresh_interval: 30s

Building

# Build for current platform
make build

# Build for Linux (deployment)
make build-linux

# Build for all platforms
make build-all

# Run tests
make test

Deployment

Systemd

# Copy binary
sudo cp bin/prometheus-sd-scanner-linux-amd64 /opt/net-discovery/prometheus-sd-scanner
sudo chmod +x /opt/net-discovery/prometheus-sd-scanner

# Copy service file
sudo cp prometheus-sd-scanner.service /etc/systemd/system/

# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus-sd-scanner

# Check status
sudo systemctl status prometheus-sd-scanner
journalctl -u prometheus-sd-scanner -f

Quick deploy

make deploy  # Deploys to invim01p2

Metrics

The scanner exposes the following Prometheus metrics at /metrics:

Scan Metrics

Metric Type Description
scanner_scan_duration_seconds Histogram Duration of network scans
scanner_scans_total Counter Total number of scans performed
scanner_scan_errors_total Counter Total number of scan errors

Target Metrics

Metric Type Description
scanner_targets_total{port} Gauge Discovered targets by port
scanner_targets_discovered Gauge Total unique targets discovered

Connection Metrics

Metric Type Description
scanner_connection_attempts_total{port} Counter TCP connection attempts
scanner_connection_successes_total{port} Counter Successful TCP connections
scanner_connection_timeouts_total{port} Counter TCP connection timeouts

DNS Metrics

Metric Type Description
scanner_dns_lookups_total Counter DNS reverse lookups
scanner_dns_cache_hits_total Counter DNS cache hits
scanner_dns_lookup_errors_total Counter DNS lookup errors

HTTP Metrics

Metric Type Description
scanner_http_requests_total{path,status} Counter HTTP requests by path and status
scanner_http_request_duration_seconds{path} Histogram Request duration by endpoint

Troubleshooting

Common Issues

Issue: Health endpoint returns 503
─────────────────────────────────────────────────────────────────────────────
Cause:  Initial scan not yet complete
Fix:    Wait for first scan to finish (check logs)

Issue: Low target count
─────────────────────────────────────────────────────────────────────────────
Causes: Firewall blocking scanner, exporters not running, wrong subnet
Debug:  curl http://scanner:8080/metrics | grep connection_timeouts

Issue: Scan taking too long
─────────────────────────────────────────────────────────────────────────────
Causes: Low workers/rate-limit, network latency, slow DNS
Fix:    Increase -workers and -rate-limit

Issue: Missing hostnames (showing IPs)
─────────────────────────────────────────────────────────────────────────────
Cause:  PTR records not configured for those IPs
Debug:  dig -x <IP_ADDRESS>

Useful Commands

# Check service status
systemctl status prometheus-sd-scanner

# View logs
journalctl -u prometheus-sd-scanner -f

# Test endpoints
curl -s http://localhost:8080/health | jq .
curl -s http://localhost:8080/targets.json | jq length
curl -s http://localhost:8080/metrics | grep scanner_

Project Structure

prometheus-sd-scanner/
├── main.go              # Entry point, lifecycle management
├── config.go            # Configuration loading (CLI + env)
├── scanner.go           # Network scanning logic, worker pool
├── dns.go               # Reverse DNS resolution with caching
├── http.go              # HTTP server and endpoints
├── metrics.go           # Prometheus metrics definitions
├── logger.go            # Structured logging
├── go.mod
├── go.sum
├── Makefile             # Build targets
├── prometheus-sd-scanner.service  # Systemd unit file
└── README.md

License

MIT

About

Network Scanner and HTTP server for your Prometheus/Victoria-Metrics to elegantly run DNS_SD

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors