Release FOCUS Scrub 0.1.0 by mike-finopsorg · Pull Request #8 · finopsfoundation/focus_scrub

mike-finopsorg · 2026-03-12T10:45:19Z

Summary

This PR prepares the FOCUS Scrub 0.1.0 release by adding CI + coverage enforcement, expanding CLI capabilities (SQL output, column dropping, tag key scrubbing, dates-only mode), improving handler robustness, and significantly expanding test coverage and documentation.

Key changes

New output format: SQL

Adds sql as an output format alongside parquet and csv-gzip.

Generates:
CREATE TABLE IF NOT EXISTS ... DDL with inferred column types
batched bulk INSERT statements (1000 rows per batch)
optional custom table name via --sql-table-name
Implemented in focus_scrub/focus_scrub/io.py and wired through the CLI.
CLI enhancements

Adds new CLI flags in focus_scrub/focus_scrub/cli.py:

--drop-columns [COLUMN ...] (defaults to dropping x_Discounts; pass an empty list to keep it)
--scrub-tag-keys (scrub tag keys as well as values)
--dates-only (only shifts date fields; leaves other scrubbing disabled)
--sql-table-name (custom table name for SQL output)

Scrubbing behavior improvements

TagsHandler now supports:

native list and dict tag structures (not just string-encoded forms)
optional tag key scrubbing (scrub_tag_keys)
Handlers now guard pd.isna() calls to avoid errors on non-scalar / incompatible types.
Dataset handler selection supports dates_only mode (skips non-date handlers when enabled).
Column dropping

DataFrameScrub now supports drop_columns, with a default behavior to drop x_Discounts.

CI + coverage

Builds and Tests

Adds GitHub Actions workflow .github/workflows/ci.yml to run lint + tests + coverage, and upload coverage to Codecov.
Adds pytest-cov, coverage configuration, and enforces minimum 80% coverage via pytest.ini.
Updates Makefile with make coverage / make coverage-html.
Updates .gitignore to ignore coverage outputs.
Updates .pre-commit-config.yaml (ruff pre-commit hook revision bump and removes --fix arg).
Adds new test suites:
tests/test_cli.py (CLI integration tests for new flags and output formats)
tests/test_io.py (file discovery, IO, and SQL generation coverage)
Expands integration + handler tests for:
drop_columns default/override behavior
tag key scrubbing behavior and consistency
dates-only mode behavior

README expanded

clearer explanation of what is not scrubbed (metrics)
SQL output format docs + examples
--drop-columns, --scrub-tag-keys, and --dates-only usage
development/testing/coverage guidance

Files changed (high level)

CI/Tooling: .github/workflows/ci.yml, .pre-commit-config.yaml, pytest.ini, pyproject.toml, poetry.lock, Makefile, .gitignore
Core code: focus_scrub/focus_scrub/cli.py, focus_scrub/focus_scrub/io.py, focus_scrub/focus_scrub/handlers.py, focus_scrub/focus_scrub/scrub.py
Tests: tests/test_cli.py, tests/test_io.py, plus updates to tests/test_handlers.py, tests/test_integration.py

Signed-off-by: Mike Fuller <mike@finops.org>

… types in TagsHandler

Added feature to drop vendor columns. Signed-off-by: Mike Fuller <mike@finops.org>

Added a new CLI option --scrub-tag-keys to allow scrubbing of tag keys in addition to tag values (default behavior only scrubs values). Signed-off-by: Mike Fuller <mike@finops.org>

Added a new CLI option --dates-only to only shift dates while leaving all other data unchanged, useful for re-dating FOCUS datasets without scrubbing sensitive information. Signed-off-by: Mike Fuller <mike@finops.org>

Bulk Import statements Signed-off-by: Mike Fuller <mike@finops.org>

Signed-off-by: Mike Fuller <mike@finops.org>

Matt-Cowsert

Solid release PR. The new features (SQL output, tag key scrubbing, dates-only mode, column dropping) are well-structured and thoroughly tested. The handler architecture with shared MappingEngine is clean. A few items need attention before merge.

Suggestions not in the diff

S3. .pre-commit-config.yaml lines 26-38 - Hardcoded path won't work for other contributors

Line 37 points to /Users/mfuller/git/github/FOCUS_Scrub/.venv/bin/python. That path only exists on one machine. Other contributors and CI will fail on pre-commit.

Suggested replacement for lines 26-38:

  - repo: local
    hooks:
      - id: pytest
        name: pytest
        entry: poetry run pytest tests/ -v
        language: system
        pass_filenames: false
        always_run: true

Using poetry run makes it portable across machines.

Comments (non-blocking)

C1. handlers.py - _scramble_string is copy-pasted between two classes

ResourceIdHandler and TagsHandler both have identical _get_char_mapping and _scramble_string methods. Not urgent, but pulling these into a shared mixin or helper would cut the duplication.

C2. scrub.py line 24 + cli.py line 75 - x_Discounts default lives in two places

The argparse default in cli.py and the fallback in scrub.py both independently default to ["x_Discounts"]. The scrub.py fallback is never reached from the CLI path. If this default ever changes, it's easy to update one and miss the other.

C3. .github/workflows/ci.yml - Codecov uploads will silently fail

codecov-action@v4 needs a CODECOV_TOKEN repository secret to upload. Without one, the step passes (since fail_ci_if_error: false) but coverage never reaches Codecov.

Matt-Cowsert · 2026-03-12T15:29:49Z

+        columns = df.columns.tolist()
+
+        # Write CREATE TABLE statement
+        f.write(f"CREATE TABLE IF NOT EXISTS {table_name} (\n")


Suggested change

f.write(f"CREATE TABLE IF NOT EXISTS {table_name} (\n")

f.write(f'CREATE TABLE IF NOT EXISTS "{table_name}" (\n')

If someone passes --sql-table-name "order" or another SQL reserved word, this statement will fail when loaded into a database. Double-quoting the identifier (ANSI standard) handles reserved words and special characters.

Matt-Cowsert · 2026-03-12T15:29:50Z

+
+        # Write CREATE TABLE statement
+        f.write(f"CREATE TABLE IF NOT EXISTS {table_name} (\n")
+        column_defs = ["  id BIGINT AUTO_INCREMENT PRIMARY KEY"]


AUTO_INCREMENT is MySQL-only syntax. The README (line 191) says the SQL output works with "PostgreSQL, MySQL, SQLite, etc." but PostgreSQL uses GENERATED ALWAYS AS IDENTITY and SQLite uses AUTOINCREMENT (one word, different position).

Two ways to fix:

Remove the synthetic id column (suggestion below). Most data loading workflows don't need one.

Keep it, but update the README to say the SQL targets MySQL specifically.

Up to you on which direction makes more sense.

If option 1:

Suggested change

column_defs = [" id BIGINT AUTO_INCREMENT PRIMARY KEY"]

column_defs = []

This would also require removing id from the README SQL examples.

Matt-Cowsert · 2026-03-12T15:29:50Z

+        column_defs = ["  id BIGINT AUTO_INCREMENT PRIMARY KEY"]
+        for col in columns:
+            sql_type = _pandas_dtype_to_sql_type(df[col].dtype)
+            column_defs.append(f"  {col} {sql_type}")


Suggested change

column_defs.append(f" {col} {sql_type}")

column_defs.append(f' "{col}" {sql_type}')

FOCUS column names are PascalCase and unlikely to collide with reserved words, but quoting costs nothing and prevents surprises with custom columns.

Matt-Cowsert · 2026-03-12T15:29:50Z

+            return
+
+        # Column list for INSERT statements
+        column_list = ", ".join(f"{col}" for col in columns)


Suggested change

column_list = ", ".join(f"{col}" for col in columns)

column_list = ", ".join(f'"{col}"' for col in columns)

Same reasoning as the column definition quoting. This ensures the INSERT column list matches the quoted CREATE TABLE definitions.

Matt-Cowsert · 2026-03-12T15:29:50Z

+            batch_end = min(batch_start + batch_size, len(df))
+            batch_df = df.iloc[batch_start:batch_end]
+
+            f.write(f"INSERT INTO {table_name} ({column_list})\n")


Suggested change

f.write(f"INSERT INTO {table_name} ({column_list})\n")

f.write(f'INSERT INTO "{table_name}" ({column_list})\n')

Matches the CREATE TABLE quoting. Every reference to the table name should be quoted consistently.

Matt-Cowsert · 2026-03-12T15:29:50Z

+                    elif isinstance(value, (int, float)):
+                        # Check if it's actually NaN
+                        try:
+                            is_na_num = pd.isna(value)
+                            if hasattr(is_na_num, "__len__") and not isinstance(is_na_num, str):
+                                is_na_num = False
+                        except (ValueError, TypeError):
+                            is_na_num = False
+
+                        if is_na_num:
+                            values.append("NULL")
+                        else:
+                            values.append(str(value))


The code checks pd.isna(value) at line 176 and sends NaN values to NULL. Then in this int/float branch, it checks pd.isna(value) again. That second check can never be true, because if the value were NaN, the first check would have already handled it.

Suggested change

elif isinstance(value, (int, float)):

# Check if it's actually NaN

try:

is_na_num = pd.isna(value)

if hasattr(is_na_num, "__len__") and not isinstance(is_na_num, str):

is_na_num = False

except (ValueError, TypeError):

is_na_num = False

if is_na_num:

values.append("NULL")

else:

values.append(str(value))

elif isinstance(value, (int, float)):

values.append(str(value))

This replaces the 13-line block with the only branch that can actually execute.

Signed-off-by: Mike Fuller <mike@finops.org>

mike-finopsorg added 11 commits February 26, 2026 23:45

Added workflow to run tests on commit

6419d29

Signed-off-by: Mike Fuller <mike@finops.org>

lint cleanup

7473714

Signed-off-by: Mike Fuller <mike@finops.org>

More lint fixes

d61f52b

Signed-off-by: Mike Fuller <mike@finops.org>

Fix pd.isna() ValueError for array values and handle native list/dict…

3b17c7e

… types in TagsHandler

#2 Handle provider custom columns

91575f6

Added feature to drop vendor columns. Signed-off-by: Mike Fuller <mike@finops.org>

#4 Add option to scrub the Tag key names

567d238

Added a new CLI option --scrub-tag-keys to allow scrubbing of tag keys in addition to tag values (default behavior only scrubs values). Signed-off-by: Mike Fuller <mike@finops.org>

#1 Add date change only option.

aaffd2f

Added a new CLI option --dates-only to only shift dates while leaving all other data unchanged, useful for re-dating FOCUS datasets without scrubbing sensitive information. Signed-off-by: Mike Fuller <mike@finops.org>

#3 Added output format SQL

78c699a

Bulk Import statements Signed-off-by: Mike Fuller <mike@finops.org>

added option --sql-table-name to specify the table name in SQL output

2d3168a

Signed-off-by: Mike Fuller <mike@finops.org>

#7 Added coverage tests, will fail for < 80% coverage

df77deb

Signed-off-by: Mike Fuller <mike@finops.org>

#5 Document why changing metric columns is problematic

36e3240

Signed-off-by: Mike Fuller <mike@finops.org>

mike-finopsorg self-assigned this Mar 12, 2026

mike-finopsorg requested a review from Matt-Cowsert March 12, 2026 10:48

Matt-Cowsert requested changes Mar 12, 2026

View reviewed changes

Added handlers for provider custom columns

2e36653

Signed-off-by: Mike Fuller <mike@finops.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release FOCUS Scrub 0.1.0#8

Release FOCUS Scrub 0.1.0#8
mike-finopsorg wants to merge 12 commits into
mainfrom
dev

mike-finopsorg commented Mar 12, 2026 •

edited

Loading

Uh oh!

Matt-Cowsert left a comment

Uh oh!

Matt-Cowsert Mar 12, 2026

Uh oh!

Matt-Cowsert Mar 12, 2026

Uh oh!

Matt-Cowsert Mar 12, 2026

Uh oh!

Matt-Cowsert Mar 12, 2026

Uh oh!

Matt-Cowsert Mar 12, 2026

Uh oh!

Matt-Cowsert Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	f.write(f"CREATE TABLE IF NOT EXISTS {table_name} (\n")
	f.write(f'CREATE TABLE IF NOT EXISTS "{table_name}" (\n')

	column_defs = [" id BIGINT AUTO_INCREMENT PRIMARY KEY"]
	column_defs = []

	column_defs.append(f" {col} {sql_type}")
	column_defs.append(f' "{col}" {sql_type}')

	column_list = ", ".join(f"{col}" for col in columns)
	column_list = ", ".join(f'"{col}"' for col in columns)

	f.write(f"INSERT INTO {table_name} ({column_list})\n")
	f.write(f'INSERT INTO "{table_name}" ({column_list})\n')

Conversation

mike-finopsorg commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

New output format: SQL

Adds sql as an output format alongside parquet and csv-gzip.

Adds new CLI flags in focus_scrub/focus_scrub/cli.py:

Scrubbing behavior improvements

TagsHandler now supports:

DataFrameScrub now supports drop_columns, with a default behavior to drop x_Discounts.

Builds and Tests

README expanded

Files changed (high level)

Uh oh!

Matt-Cowsert left a comment

Choose a reason for hiding this comment

Suggestions not in the diff

Comments (non-blocking)

Uh oh!

Matt-Cowsert Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Matt-Cowsert Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Matt-Cowsert Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Matt-Cowsert Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Matt-Cowsert Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Matt-Cowsert Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mike-finopsorg commented Mar 12, 2026 •

edited

Loading