Skip to content

Feat/vd 4347 outlier detection cleaning#128

Draft
AlexanderPietsch wants to merge 59 commits intodevfrom
feat/VD-4347-OutlierDetectionCleaning
Draft

Feat/vd 4347 outlier detection cleaning#128
AlexanderPietsch wants to merge 59 commits intodevfrom
feat/VD-4347-OutlierDetectionCleaning

Conversation

@AlexanderPietsch
Copy link
Contributor

@AlexanderPietsch AlexanderPietsch commented Mar 16, 2026

Summary

This branch introduces outlier-detection as a first-class data-quality gate.
It also Refactors the validation/evaluation foundation to make validation behavior explicit, cache-safe, and dashboard-ready.
The branch simplifies validation config shape, adds run-level validation metadata persistence, and prevents evaluation-cache contamination by keying cache entries with an effective per-job validation profile hash.

Changes

  • Validation config simplification:

    • validation.data_quality.on_fail is required when data_quality is configured.
    • Computes run-level validation metadata (resolved profile + active/inactive gates).
    • Persists run metadata once per run to normalized ResultStore.
    • Adds active validation gates to CLI run summary and summary payload.
  • Storage:

    • ResultStore: new run_metadata table with:
      • validation_profile_json
      • active_gates_json
      • inactive_gates_json
    • EvaluationCache: cache key now includes validation_config_hash.
    • Runner computes effective per-job validation profile hash (global + collection override resolution) and passes it on cache get/set.
  • Documentation/config examples:

    • Updated README validation section to reflect actual optionality/required behavior and continuity diagnostics nuance.
    • Updated config/example.yaml to current validation schema.
  • Tests:

    • Updated config and runner tests for scalar validation fields.
    • Added evaluation-store run metadata round-trip test.
    • Added evaluation-cache validation-hash isolation test.
  • Outlier detection introduced:

  • Adds validation.data_quality.outlier_detection gate with:

    • max_outlier_pct
    • method (zscore or modified_zscore)
    • zscore_threshold
  • Integrates outlier checks into data-validation reliability reasons and gate decisions.

  • Handles indeterminate modified-zscore cases explicitly (e.g. mad_zero) via structured rejection reason.

Breaking changes:

  • Validation config shape changed:
    • validation.data_quality.min_data_points.min -> validation.data_quality.min_data_points
    • validation.data_quality.kurtosis.max -> validation.data_quality.kurtosis
  • validation.data_quality.on_fail is now required when data_quality exists.
  • Evaluation cache schema/key includes validation_config_hash (old cache rows are not reused under new keying).

How to Test

  • Full suite:
    • make tests
  • Optional end-to-end run:
    • make run
  • Optional dashboard source toggle:
    • EVALUATION_RESULTS_SOURCE=result_store

Relevant config/env:

  • Validation config is under validation.data_quality and validation.optimization.
  • evaluation_mode still defaults to backtest.

Checklist (KISS)

  • Pre-commit passes locally (pre-commit run --all-files)
  • Tests added/updated where it makes sense (80% cov gate)
  • Docs/README updated if needed
  • No secrets committed; .env values are excluded
  • Backward compatibility considered (configs, CLI flags)

Notes:

  • Backward compatibility was considered, but this branch intentionally introduces config/cache key breakage for correctness and long-term maintainability.
  • make tests passes on this branch.

Related Issues/Links

  • Closes #
  • References #

Note

Medium Risk
Touches core backtest gating, evaluation caching keys, and SQLite schemas; mistakes could invalidate cached results or incorrectly skip/reject jobs/results. Changes are covered by expanded tests, but require attention to migration/backward-compat behavior and policy resolution correctness.

Overview
RCA: Validation behavior was implicitly merged at runtime and evaluation caching was keyed only by mode/data fingerprints, allowing policy changes (or per-collection overrides) to contaminate cache correctness and making gate activation opaque.

The Fix: Restructures validation config into explicit modules (data_quality.continuity, data_quality.outlier_detection, result_consistency.*) and resolves global-vs-collection overrides during load_config via resolve_validation_overrides. The runner now computes a per-job validation_config_hash to key EvaluationCache entries, adds new data-quality outlier checks plus result-consistency gates (trade PnL concentration and execution fill price variance), enriches evaluation stats with trade_meta, and persists run-level validation profiles + active/inactive gate IDs in a new ResultStore.run_metadata table (also surfaced in CLI/dashboard summary JSON).

The Proof: Updates/adds unit tests across config parsing, runner gating, evaluator trade metadata, and store/cache behavior (including validation-hash cache isolation and run-metadata round-trip); make tests passes and coverage remains strictly >80%.

Telemetry Added: Run summaries and dashboard payloads now include resolved validation profiles plus active_gates/inactive_gates, and the same metadata is persisted per-run in result_store (run_metadata) for post-run inspection.

Written by Cursor Bugbot for commit f4f32c8. This will update automatically on new commits. Configure here.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the backtesting engine by integrating robust data quality validation, including a new outlier detection mechanism. It streamlines the configuration of validation policies and improves the reliability and traceability of evaluation results through refactored data handling and persistent metadata. These changes aim to provide more explicit control over data quality and optimization feasibility, ensuring more dependable and reproducible backtest outcomes.

Highlights

  • Outlier Detection: Introduced a new outlier_detection data quality gate with configurable max_outlier_pct, method (zscore/modified_zscore), and zscore_threshold to identify and handle anomalous data points.
  • Refactored Validation & Evaluation: The core validation and evaluation foundation has been refactored to make validation behavior explicit, ensure cache safety, and prepare for dashboard integration. This includes new data structures for job context, gate decisions, and evaluation outcomes.
  • Validation Configuration Simplification: Simplified the validation configuration shape, making validation.data_quality.on_fail a required field when data quality is configured. Configuration now supports global and per-collection overrides for validation policies.
  • Run-Level Validation Metadata Persistence: Implemented persistence of run-level validation metadata (resolved profile, active/inactive gates) to a new ResultStore table, providing a comprehensive record of validation policies applied during a run.
  • Evaluation Cache Contamination Prevention: Enhanced the evaluation cache key to include a validation_config_hash, preventing contamination by ensuring cache entries are unique to the effective per-job validation profile.
  • Breaking Changes: Introduced breaking changes to the validation config shape, specifically min_data_points and kurtosis fields, and made validation.data_quality.on_fail mandatory. The evaluation cache schema has also changed, meaning old cache rows will not be reused under the new keying.
Changelog
  • AGENTS.md
    • Updated the 'How to run' section to recommend make targets for common operations.
    • Added a new 'Design rules' section outlining cognitive complexity guidelines for functions.
  • DEVELOPMENT.md
    • Added a new document detailing the backtest runner's high-level flow, gate model, evaluation model, and continuity score calendar behavior.
  • Makefile
    • Added new refresh-image, refresh-image-nc, tests, coverage, and precommit-coverage targets.
    • Removed poetry install and poetry run from most run commands for cleaner execution.
    • Updated dashboard port mapping and removed git add poetry.lock from lock commands.
  • README.md
    • Added documentation for the EVALUATION_RESULTS_SOURCE environment variable.
    • Included the --evaluation-mode CLI option and updated run summary metrics to reflect fresh simulation/metric evaluations.
    • Added a detailed 'Validation & Optimization Policy' section explaining the new configuration options for data quality and optimization gates.
  • config/example.yaml
    • Added a new validation section with data_quality and optimization configurations.
    • Introduced outlier_detection parameters within data_quality.
    • Included commented-out examples for per-collection validation overrides.
  • poetry.lock
    • Updated click and typer package versions.
    • Added new dependencies: exchange-calendars, korean-lunar-calendar, pyluach, and toolz.
  • pyproject.toml
    • Updated typer and click dependency versions.
    • Added exchange-calendars as a new dependency.
  • src/backtest/evaluation/init.py
    • Added a new initialization file for the evaluation module, exposing its core components.
  • src/backtest/evaluation/adapters.py
    • Added a new module with a utility function normalized_rows_to_legacy_rows for data transformation.
  • src/backtest/evaluation/contracts.py
    • Added a new module defining data contracts for evaluation requests, outcomes, and result records.
  • src/backtest/evaluation/evaluator.py
    • Added a new module introducing Evaluator protocol and BacktestEvaluator for handling simulation and metric evaluation.
  • src/backtest/evaluation/store.py
    • Added a new module implementing EvaluationCache and ResultStore for persistent storage of evaluation results and run metadata using SQLite.
  • src/backtest/results_cache.py
    • Introduced ResultsCacheRecord dataclass for structured cache entries.
    • Modified ResultsCache to include evaluation_mode and mode_config_hash in its primary key, along with migration logic for existing caches.
    • Updated the set method to accept a ResultsCacheRecord object.
  • src/backtest/runner.py
    • Refactored the BacktestRunner to incorporate new data structures for job context, gate decisions, and validation states.
    • Implemented a multi-stage validation pipeline including collection, data fetching, data validation, execution context preparation, strategy plan validation, and strategy result validation.
    • Integrated EvaluationCache and ResultStore for enhanced caching and persistence of evaluation data and run metadata.
    • Extended _bars_per_year to support monthly timeframes and added new methods for continuity score calculation and outlier detection.
    • Removed direct configuration of param_dof_multiplier and param_min_bars, now managed through validation policies.
    • Added methods for serializing and hashing validation profiles to ensure cache isolation.
  • src/config.py
    • Introduced new dataclasses for detailed validation configurations: ValidationCalendarConfig, ValidationDataQualityConfig, ValidationContinuityConfig, ValidationOutlierDetectionConfig, ValidationConfig, and OptimizationPolicyConfig.
    • Removed param_dof_multiplier and param_min_bars from the main Config class.
    • Added evaluation_mode and validation fields to the Config class.
    • Implemented helper functions for merging and parsing validation configurations, including normalize_validation_defaults to apply default values and overrides.
  • src/main.py
    • Defined SUMMARY_JSON_FILENAME constant for consistent file naming.
    • Added an evaluation_mode CLI option to override the configuration's evaluation mode.
    • Updated BacktestRunner instantiation to pass the configured evaluation_mode.
    • Modified dashboard payload generation to optionally use the new ResultStore via an environment variable.
    • Included validation metadata in the run summary and updated console output to display new metrics like fresh_simulation_runs and active validation gates.
  • tests/test_backtest_runner.py
    • Added _StubEvaluationCache for testing purposes.
    • Expanded tests for _bars_per_year and introduced tests for _timeframe_to_timedelta.
    • Added comprehensive tests for compute_continuity_score covering various data scenarios and calendar types.
    • Included tests for evaluation_cache persistence and isolation.
    • Added tests to verify strategy skipping behavior based on validation gate failures, including min data points, continuity, kurtosis, and outlier detection.
    • Introduced tests for collection-level validation overrides and the blocking of jobs within a collection due to validation failures.
    • Added a test to confirm rejection of unimplemented walk-forward evaluation mode.
  • tests/test_config.py
    • Added tests for loading and validating the new evaluation_mode configuration.
    • Included extensive tests for parsing and validating the new validation configurations, covering data_quality (min data points, continuity, on_fail, calendar settings, outlier detection) and optimization policies.
  • tests/test_evaluation_store.py
    • Added a new test file to verify the functionality of EvaluationCache and ResultStore, including tests for mode hash, validation hash, and run metadata round-trip persistence.
  • tests/test_main_cli.py
    • Updated DummyRunner constructor to accommodate new parameters.
    • Added tests to confirm the evaluation_mode CLI option correctly overrides the configuration and handles invalid inputs.
  • tests/test_results_cache.py
    • Added a test to ensure the ResultsCache correctly distinguishes and stores entries based on different evaluation modes.
Activity
  • AlexanderPietsch initiated the pull request to introduce outlier detection and refactor the validation system.
  • Significant code changes were made across multiple modules to implement the new validation framework, including new data contracts, evaluation stores, and runner logic.
  • Configuration files and documentation were updated to reflect the new validation policies and breaking changes.
  • New unit tests were added and existing ones modified to ensure the correctness and robustness of the new features, especially around data quality gates and caching mechanisms.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@AlexanderPietsch AlexanderPietsch changed the base branch from dev to feat/VD-4344-data-collection-reliabilty-checks March 16, 2026 03:16
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive validation and optimization policy for the backtesting system, refactoring the BacktestRunner to implement a multi-stage gating system for job and strategy execution. New data structures and SQLite-based caching (EvaluationCache, ResultStore) are added to manage evaluation results and run metadata, alongside updated dependency management. Documentation is expanded to cover the new validation rules and CLI options. A review comment suggests improving resource management in src/backtest/evaluation/store.py by using with statements for sqlite3 connections to ensure automatic closing and transaction handling, even during errors.

I am having trouble creating individual review comments. Click here to see my feedback.

src/backtest/evaluation/store.py (150-193)

medium

For improved resource management and to make the code more idiomatic, consider using a with statement for handling sqlite3 connections. This ensures the connection is automatically closed and transactions are committed or rolled back, even if errors occur. This pattern can be applied to all methods in this file that interact with the database (_ensure, get, set in EvaluationCache, and all methods in ResultStore).

        with sqlite3.connect(self.db_path) as con:
            con.execute(
                """
                INSERT OR REPLACE INTO evaluation_cache
                (
                    collection,
                    symbol,
                    timeframe,
                    strategy,
                    params_json,
                    metric_name,
                    metric_value,
                    stats_json,
                    data_fingerprint,
                    fees,
                    slippage,
                    evaluation_mode,
                    mode_config_hash,
                    validation_config_hash,
                    engine_version
                )
                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
                """,
                (
                    collection,
                    symbol,
                    timeframe,
                    strategy,
                    params_json,
                    metric_name,
                    float(metric_value),
                    json.dumps(stats, sort_keys=True),
                    data_fingerprint,
                    fees,
                    slippage,
                    evaluation_mode,
                    mode_config_hash,
                    validation_config_hash,
                    EVALUATION_SCHEMA_VERSION,
                ),
            )

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Stale global policy reference used in per-collection merge
    • After normalizing global validation policies, the function now refreshes the global policy references from validation_cfg before per-collection merges.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Autofix Details

Bugbot Autofix resolved 1 of the 3 issues found in the latest run.

  • ✅ Fixed: Stats dict self-references during trade meta construction
    • evaluate now snapshots raw simulation stats and passes that immutable snapshot into _build_trade_meta so evaluator-injected keys are never read in trade meta construction.

@sonarqubecloud
Copy link

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

metric_val = float(cached["metric_value"])
plan.evaluations += 1
if not np.isfinite(metric_val):
return float("-inf")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cached non-finite evaluations silently lost from metrics tracking

Low Severity

In _apply_cached_evaluation, when the cached metric_value is non-finite (e.g., -inf from a previously invalid evaluation), the method returns early at line 2060 before incrementing result_cache_hits. Since result_cache_misses is only incremented when evaluation_cache.get() returns None, these cached-but-invalid evaluations are counted by neither counter. Previously, all cache hits were always counted. This creates a metrics gap where result_cache_hits + result_cache_misses < total_evaluations, which could mislead monitoring or observability consumers.

Fix in Cursor Fix in Web

Triggered by team rule: Master Directive

Base automatically changed from feat/VD-4344-data-collection-reliabilty-checks to dev March 23, 2026 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants