Skip to content

[BUG] query_stats_collector: query_plan_text collection causes unsustainable database growth #337

@ydeo-fcourtaud

Description

@ydeo-fcourtaud

Component

SQL collection scripts

Performance Monitor Version

PerformanceMonitor v2.0.0

SQL Server Version

SQL Server 2019 Standard Edition (15.0.2130.3)

Windows Version

Windows Server 2019

Describe the Bug

collect.query_stats_collector collects execution plans via
sys.dm_exec_text_query_plan for every query in
sys.dm_exec_query_stats. On busy instances with many databases,
this causes unsustainable storage growth.

The plan XML is already collected separately by query_store_collector
in collect.query_store_data, making the query_plan_text column
in collect.query_stats redundant.

Additionally, the default collection frequency of 2 minutes combined
with the Hung Job Monitor threshold of 5 minutes means the collector
is systematically killed before it can finish on busy instances,
while still generating massive I/O and tempdb pressure on every attempt.

Steps to Reproduce

  1. Install PerformanceMonitor v2.0.0 on a SQL Server instance
    with 20+ active databases

  2. Let the collection job run at its default schedule (every 2 minutes)

  3. Check storage consumption of collect.query_stats after a few hours:

    SELECT
    COUNT() AS nb_lignes,
    SUM(DATALENGTH(query_plan_text)) / 1024 / 1024 AS plans_xml_mb,
    SUM(DATALENGTH(query_text)) / 1024 / 1024 AS query_text_mb,
    SUM(DATALENGTH(query_plan_text) + DATALENGTH(query_text))
    / COUNT(
    ) / 1024 AS avg_kb_per_row
    FROM collect.query_stats;

  4. Check collection_log for duration and rows collected per run

Expected Behavior

Storage growth of collect.query_stats should be proportional to
query activity and manageable within the configured retention window.

With query_text only (~1.3 KB per row), 7 days of data at 25,000 rows
per 15-minute collection cycle would represent ~150 MB —
perfectly manageable.

The collector should complete within its max_duration_minutes
without being killed by the Hung Job Monitor.

Actual Behavior

On our environment (20+ active databases, collection every 15 minutes
after we manually corrected the frequency):

collect.query_stats after ~2 hours of collection:

  • 160,566 rows
  • 11.4 GB total
  • of which query_plan_text alone: 6.5 GB (97%)
  • of which query_text: 143 MB (1%)

Average row size: 63 KB — almost entirely due to the XML plan.

At ~25,000 new rows per collection cycle every 15 minutes:

  • ~1.5 GB added every 15 minutes
  • ~6 GB per hour
  • Projected: ~144 GB/day

At the default frequency of 2 minutes the situation is 3x worse.

The PerformanceMonitor database exhausted all available
disk space within hours of installation.

Error Messages / Log Output

No explicit errors — the collector runs successfully but silently 
consumes disk space until the drive is full.

Observed in collection_log:

    collector_name          rows_collected  duration_ms
    query_stats_collector       37,311       82,899
    query_stats_collector       37,237      115,746
    query_stats_collector       32,681      110,120
    query_stats_collector       27,874       59,553
    query_stats_collector       25,463       53,427
    query_stats_collector       23,330       71,789

At default frequency (every 2 minutes), the Hung Job Monitor 
kills the job after 5 minutes on every run:

    collector_name    collection_status   error_message
    hung_job_monitor  JOB_HUNG            Job "PerformanceMonitor - Collection" 
                                          hung for 7 minutes (threshold: 5 minutes). 
                                          Stopping job.

Screenshots

No response

Additional Context

Environment:

  • SQL Server 2019 Standard Edition (15.0.2130.3)
  • Windows Server 2019
  • PerformanceMonitor v2.0.0
  • 20+ active databases

Root cause #1 — Plan XML storage:
The OUTER APPLY to sys.dm_exec_text_query_plan fetches and stores
the full XML execution plan for every collected query:

-- This single column represents 97% of table size:
query_plan_text = tqp.query_plan

OUTER APPLY sys.dm_exec_text_query_plan
(
    qs.plan_handle,
    qs.statement_start_offset,
    qs.statement_end_offset
) AS tqp

Root cause #2 — Default frequency too aggressive:
Default schedule for heavy collectors:

collector_name          frequency_minutes   max_duration_minutes
query_stats_collector           2                   5
query_store_collector           2                  10
procedure_stats_collector       2                  10

Combined with Hung Job Monitor threshold of 5 minutes, these
collectors are systematically killed before completion on busy instances.

Proposed fixes:

Fix #1 — Disable plan XML collection in query_stats_collector:

-- Remove:
OUTER APPLY sys.dm_exec_text_query_plan(...) AS tqp

-- Replace:
query_plan_text = tqp.query_plan

-- With:
query_plan_text = NULL

This reduces average row size from ~63 KB to ~1.3 KB —
a 98% reduction in storage consumption.

Alternatively, make plan collection opt-in via a parameter:

@collect_query_plans BIT = 0

Fix #2 — Increase default frequency for heavy collectors
in installation scripts:

-- In 04_create_schedule_table.sql or equivalent:
query_stats_collector:      frequency_minutes = 15, max_duration_minutes = 12
query_store_collector:      frequency_minutes = 15, max_duration_minutes = 12
procedure_stats_collector:  frequency_minutes = 15, max_duration_minutes = 12
plan_cache_stats_collector: frequency_minutes = 10, max_duration_minutes = 8

Fix #3 — Increase Hung Job Monitor default threshold
in 45_create_agent_jobs.sql:

-- Replace:
@normal_max_duration_minutes = 5

-- With:
@normal_max_duration_minutes = 15

Workarounds applied on our environment:

  1. Patched collect.query_stats_collector locally to set
    query_plan_text = NULL and removed OUTER APPLY
    sys.dm_exec_text_query_plan. After truncating collect.query_stats
    (11.4 GB) and applying the patch, projected storage growth dropped
    from ~144 GB/day to ~150 MB/7 days.

  2. Updated collection frequencies:

    UPDATE config.collection_schedule
    SET frequency_minutes = 15, max_duration_minutes = 12
    WHERE collector_name IN
    (
    N'query_stats_collector',
    N'query_store_collector',
    N'procedure_stats_collector'
    );

    UPDATE config.collection_schedule
    SET frequency_minutes = 10, max_duration_minutes = 8
    WHERE collector_name = N'plan_cache_stats_collector';

  3. Updated Hung Job Monitor threshold:

    EXEC msdb.dbo.sp_update_jobstep
    @job_name = N'PerformanceMonitor - Hung Job Monitor',
    @step_id = 1,
    @command = N'EXECUTE config.check_hung_collector_job
    @job_name = N''PerformanceMonitor - Collection'',
    @normal_max_duration_minutes = 15,
    @first_run_max_duration_minutes = 30,
    @stop_hung_job = 1,
    @debug = 0;';

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions