[BUG] query_stats_collector: query_plan_text collection causes unsustainable database growth

### Component

SQL collection scripts

### Performance Monitor Version

PerformanceMonitor v2.0.0

### SQL Server Version

SQL Server 2019 Standard Edition (15.0.2130.3)

### Windows Version

Windows Server 2019

### Describe the Bug

`collect.query_stats_collector` collects execution plans via 
`sys.dm_exec_text_query_plan` for every query in 
`sys.dm_exec_query_stats`. On busy instances with many databases, 
this causes unsustainable storage growth.

The plan XML is already collected separately by `query_store_collector` 
in `collect.query_store_data`, making the `query_plan_text` column 
in `collect.query_stats` redundant.

Additionally, the default collection frequency of 2 minutes combined 
with the Hung Job Monitor threshold of 5 minutes means the collector 
is systematically killed before it can finish on busy instances, 
while still generating massive I/O and tempdb pressure on every attempt.

### Steps to Reproduce

1. Install PerformanceMonitor v2.0.0 on a SQL Server instance 
   with 20+ active databases
2. Let the collection job run at its default schedule (every 2 minutes)
3. Check storage consumption of collect.query_stats after a few hours:

    SELECT 
        COUNT(*) AS nb_lignes,
        SUM(DATALENGTH(query_plan_text)) / 1024 / 1024 AS plans_xml_mb,
        SUM(DATALENGTH(query_text)) / 1024 / 1024 AS query_text_mb,
        SUM(DATALENGTH(query_plan_text) + DATALENGTH(query_text)) 
            / COUNT(*) / 1024 AS avg_kb_per_row
    FROM collect.query_stats;

4. Check collection_log for duration and rows collected per run

### Expected Behavior

Storage growth of collect.query_stats should be proportional to 
query activity and manageable within the configured retention window.

With query_text only (~1.3 KB per row), 7 days of data at 25,000 rows 
per 15-minute collection cycle would represent ~150 MB — 
perfectly manageable.

The collector should complete within its max_duration_minutes 
without being killed by the Hung Job Monitor.

### Actual Behavior

On our environment (20+ active databases, collection every 15 minutes 
after we manually corrected the frequency):

collect.query_stats after ~2 hours of collection:
- 160,566 rows
- 11.4 GB total
- of which query_plan_text alone: 6.5 GB (97%)
- of which query_text: 143 MB (1%)

Average row size: 63 KB — almost entirely due to the XML plan.

At ~25,000 new rows per collection cycle every 15 minutes:
- ~1.5 GB added every 15 minutes
- ~6 GB per hour
- Projected: ~144 GB/day

At the default frequency of 2 minutes the situation is 3x worse.

The PerformanceMonitor database exhausted all available 
disk space within hours of installation.

### Error Messages / Log Output

```text
No explicit errors — the collector runs successfully but silently 
consumes disk space until the drive is full.

Observed in collection_log:

    collector_name          rows_collected  duration_ms
    query_stats_collector       37,311       82,899
    query_stats_collector       37,237      115,746
    query_stats_collector       32,681      110,120
    query_stats_collector       27,874       59,553
    query_stats_collector       25,463       53,427
    query_stats_collector       23,330       71,789

At default frequency (every 2 minutes), the Hung Job Monitor 
kills the job after 5 minutes on every run:

    collector_name    collection_status   error_message
    hung_job_monitor  JOB_HUNG            Job "PerformanceMonitor - Collection" 
                                          hung for 7 minutes (threshold: 5 minutes). 
                                          Stopping job.
```

### Screenshots

_No response_

### Additional Context

**Environment:**
- SQL Server 2019 Standard Edition (15.0.2130.3)
- Windows Server 2019
- PerformanceMonitor v2.0.0
- 20+ active databases

**Root cause #1 — Plan XML storage:**
The OUTER APPLY to sys.dm_exec_text_query_plan fetches and stores 
the full XML execution plan for every collected query:

    -- This single column represents 97% of table size:
    query_plan_text = tqp.query_plan

    OUTER APPLY sys.dm_exec_text_query_plan
    (
        qs.plan_handle,
        qs.statement_start_offset,
        qs.statement_end_offset
    ) AS tqp

**Root cause #2 — Default frequency too aggressive:**
Default schedule for heavy collectors:

    collector_name          frequency_minutes   max_duration_minutes
    query_stats_collector           2                   5
    query_store_collector           2                  10
    procedure_stats_collector       2                  10

Combined with Hung Job Monitor threshold of 5 minutes, these 
collectors are systematically killed before completion on busy instances.

**Proposed fixes:**

Fix #1 — Disable plan XML collection in query_stats_collector:

    -- Remove:
    OUTER APPLY sys.dm_exec_text_query_plan(...) AS tqp

    -- Replace:
    query_plan_text = tqp.query_plan

    -- With:
    query_plan_text = NULL

This reduces average row size from ~63 KB to ~1.3 KB — 
a 98% reduction in storage consumption.

Alternatively, make plan collection opt-in via a parameter:

    @collect_query_plans BIT = 0

Fix #2 — Increase default frequency for heavy collectors 
in installation scripts:

    -- In 04_create_schedule_table.sql or equivalent:
    query_stats_collector:      frequency_minutes = 15, max_duration_minutes = 12
    query_store_collector:      frequency_minutes = 15, max_duration_minutes = 12
    procedure_stats_collector:  frequency_minutes = 15, max_duration_minutes = 12
    plan_cache_stats_collector: frequency_minutes = 10, max_duration_minutes = 8

Fix #3 — Increase Hung Job Monitor default threshold
in 45_create_agent_jobs.sql:

    -- Replace:
    @normal_max_duration_minutes = 5

    -- With:
    @normal_max_duration_minutes = 15

**Workarounds applied on our environment:**
1. Patched collect.query_stats_collector locally to set 
   query_plan_text = NULL and removed OUTER APPLY 
   sys.dm_exec_text_query_plan. After truncating collect.query_stats 
   (11.4 GB) and applying the patch, projected storage growth dropped 
   from ~144 GB/day to ~150 MB/7 days.

2. Updated collection frequencies:

    UPDATE config.collection_schedule
    SET frequency_minutes = 15, max_duration_minutes = 12
    WHERE collector_name IN
    (
        N'query_stats_collector',
        N'query_store_collector',
        N'procedure_stats_collector'
    );

    UPDATE config.collection_schedule
    SET frequency_minutes = 10, max_duration_minutes = 8
    WHERE collector_name = N'plan_cache_stats_collector';

3. Updated Hung Job Monitor threshold:

    EXEC msdb.dbo.sp_update_jobstep
        @job_name = N'PerformanceMonitor - Hung Job Monitor',
        @step_id = 1,
        @command = N'EXECUTE config.check_hung_collector_job
        @job_name = N''PerformanceMonitor - Collection'',
        @normal_max_duration_minutes = 15,
        @first_run_max_duration_minutes = 30,
        @stop_hung_job = 1,
        @debug = 0;';

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] query_stats_collector: query_plan_text collection causes unsustainable database growth #337

Component

Performance Monitor Version

SQL Server Version

Windows Version

Describe the Bug

Steps to Reproduce

Expected Behavior

Actual Behavior

Error Messages / Log Output

Screenshots

Additional Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[BUG] query_stats_collector: query_plan_text collection causes unsustainable database growth #337

Description

Component

Performance Monitor Version

SQL Server Version

Windows Version

Describe the Bug

Steps to Reproduce

Expected Behavior

Actual Behavior

Error Messages / Log Output

Screenshots

Additional Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions