Skip to content

Release FOCUS Scrub 0.1.0#8

Open
mike-finopsorg wants to merge 12 commits into
mainfrom
dev
Open

Release FOCUS Scrub 0.1.0#8
mike-finopsorg wants to merge 12 commits into
mainfrom
dev

Conversation

@mike-finopsorg
Copy link
Copy Markdown
Contributor

@mike-finopsorg mike-finopsorg commented Mar 12, 2026

Summary

This PR prepares the FOCUS Scrub 0.1.0 release by adding CI + coverage enforcement, expanding CLI capabilities (SQL output, column dropping, tag key scrubbing, dates-only mode), improving handler robustness, and significantly expanding test coverage and documentation.

Key changes

New output format: SQL

Adds sql as an output format alongside parquet and csv-gzip.

Generates:
CREATE TABLE IF NOT EXISTS ... DDL with inferred column types
batched bulk INSERT statements (1000 rows per batch)
optional custom table name via --sql-table-name
Implemented in focus_scrub/focus_scrub/io.py and wired through the CLI.
CLI enhancements

Adds new CLI flags in focus_scrub/focus_scrub/cli.py:

--drop-columns [COLUMN ...] (defaults to dropping x_Discounts; pass an empty list to keep it)
--scrub-tag-keys (scrub tag keys as well as values)
--dates-only (only shifts date fields; leaves other scrubbing disabled)
--sql-table-name (custom table name for SQL output)

Scrubbing behavior improvements

TagsHandler now supports:

native list and dict tag structures (not just string-encoded forms)
optional tag key scrubbing (scrub_tag_keys)
Handlers now guard pd.isna() calls to avoid errors on non-scalar / incompatible types.
Dataset handler selection supports dates_only mode (skips non-date handlers when enabled).
Column dropping

DataFrameScrub now supports drop_columns, with a default behavior to drop x_Discounts.

CI + coverage

Builds and Tests

Adds GitHub Actions workflow .github/workflows/ci.yml to run lint + tests + coverage, and upload coverage to Codecov.
Adds pytest-cov, coverage configuration, and enforces minimum 80% coverage via pytest.ini.
Updates Makefile with make coverage / make coverage-html.
Updates .gitignore to ignore coverage outputs.
Updates .pre-commit-config.yaml (ruff pre-commit hook revision bump and removes --fix arg).
Adds new test suites:
tests/test_cli.py (CLI integration tests for new flags and output formats)
tests/test_io.py (file discovery, IO, and SQL generation coverage)
Expands integration + handler tests for:
drop_columns default/override behavior
tag key scrubbing behavior and consistency
dates-only mode behavior

README expanded

clearer explanation of what is not scrubbed (metrics)
SQL output format docs + examples
--drop-columns, --scrub-tag-keys, and --dates-only usage
development/testing/coverage guidance

Files changed (high level)

CI/Tooling: .github/workflows/ci.yml, .pre-commit-config.yaml, pytest.ini, pyproject.toml, poetry.lock, Makefile, .gitignore
Core code: focus_scrub/focus_scrub/cli.py, focus_scrub/focus_scrub/io.py, focus_scrub/focus_scrub/handlers.py, focus_scrub/focus_scrub/scrub.py
Tests: tests/test_cli.py, tests/test_io.py, plus updates to tests/test_handlers.py, tests/test_integration.py

Signed-off-by: Mike Fuller <mike@finops.org>
Signed-off-by: Mike Fuller <mike@finops.org>
Signed-off-by: Mike Fuller <mike@finops.org>
Added feature to drop vendor columns.
Signed-off-by: Mike Fuller <mike@finops.org>
Added a new CLI option --scrub-tag-keys to allow scrubbing of tag keys in addition to tag values (default behavior only scrubs values).
Signed-off-by: Mike Fuller <mike@finops.org>
Added a new CLI option --dates-only to only shift dates while leaving all other data unchanged, useful for re-dating FOCUS datasets without scrubbing sensitive information.

Signed-off-by: Mike Fuller <mike@finops.org>
Bulk Import statements
Signed-off-by: Mike Fuller <mike@finops.org>
Signed-off-by: Mike Fuller <mike@finops.org>
Signed-off-by: Mike Fuller <mike@finops.org>
@mike-finopsorg mike-finopsorg self-assigned this Mar 12, 2026
Copy link
Copy Markdown
Collaborator

@Matt-Cowsert Matt-Cowsert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid release PR. The new features (SQL output, tag key scrubbing, dates-only mode, column dropping) are well-structured and thoroughly tested. The handler architecture with shared MappingEngine is clean. A few items need attention before merge.

Suggestions not in the diff

S3. .pre-commit-config.yaml lines 26-38 - Hardcoded path won't work for other contributors

Line 37 points to /Users/mfuller/git/github/FOCUS_Scrub/.venv/bin/python. That path only exists on one machine. Other contributors and CI will fail on pre-commit.

Suggested replacement for lines 26-38:

  - repo: local
    hooks:
      - id: pytest
        name: pytest
        entry: poetry run pytest tests/ -v
        language: system
        pass_filenames: false
        always_run: true

Using poetry run makes it portable across machines.

Comments (non-blocking)

C1. handlers.py - _scramble_string is copy-pasted between two classes

ResourceIdHandler and TagsHandler both have identical _get_char_mapping and _scramble_string methods. Not urgent, but pulling these into a shared mixin or helper would cut the duplication.

C2. scrub.py line 24 + cli.py line 75 - x_Discounts default lives in two places

The argparse default in cli.py and the fallback in scrub.py both independently default to ["x_Discounts"]. The scrub.py fallback is never reached from the CLI path. If this default ever changes, it's easy to update one and miss the other.

C3. .github/workflows/ci.yml - Codecov uploads will silently fail

codecov-action@v4 needs a CODECOV_TOKEN repository secret to upload. Without one, the step passes (since fail_ci_if_error: false) but coverage never reaches Codecov.

columns = df.columns.tolist()

# Write CREATE TABLE statement
f.write(f"CREATE TABLE IF NOT EXISTS {table_name} (\n")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f.write(f"CREATE TABLE IF NOT EXISTS {table_name} (\n")
f.write(f'CREATE TABLE IF NOT EXISTS "{table_name}" (\n')

If someone passes --sql-table-name "order" or another SQL reserved word, this statement will fail when loaded into a database. Double-quoting the identifier (ANSI standard) handles reserved words and special characters.


# Write CREATE TABLE statement
f.write(f"CREATE TABLE IF NOT EXISTS {table_name} (\n")
column_defs = [" id BIGINT AUTO_INCREMENT PRIMARY KEY"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AUTO_INCREMENT is MySQL-only syntax. The README (line 191) says the SQL output works with "PostgreSQL, MySQL, SQLite, etc." but PostgreSQL uses GENERATED ALWAYS AS IDENTITY and SQLite uses AUTOINCREMENT (one word, different position).

Two ways to fix:

  1. Remove the synthetic id column (suggestion below). Most data loading workflows don't need one.
  2. Keep it, but update the README to say the SQL targets MySQL specifically.

Up to you on which direction makes more sense.

If option 1:

Suggested change
column_defs = [" id BIGINT AUTO_INCREMENT PRIMARY KEY"]
column_defs = []

This would also require removing id from the README SQL examples.

column_defs = [" id BIGINT AUTO_INCREMENT PRIMARY KEY"]
for col in columns:
sql_type = _pandas_dtype_to_sql_type(df[col].dtype)
column_defs.append(f" {col} {sql_type}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
column_defs.append(f" {col} {sql_type}")
column_defs.append(f' "{col}" {sql_type}')

FOCUS column names are PascalCase and unlikely to collide with reserved words, but quoting costs nothing and prevents surprises with custom columns.

return

# Column list for INSERT statements
column_list = ", ".join(f"{col}" for col in columns)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
column_list = ", ".join(f"{col}" for col in columns)
column_list = ", ".join(f'"{col}"' for col in columns)

Same reasoning as the column definition quoting. This ensures the INSERT column list matches the quoted CREATE TABLE definitions.

batch_end = min(batch_start + batch_size, len(df))
batch_df = df.iloc[batch_start:batch_end]

f.write(f"INSERT INTO {table_name} ({column_list})\n")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f.write(f"INSERT INTO {table_name} ({column_list})\n")
f.write(f'INSERT INTO "{table_name}" ({column_list})\n')

Matches the CREATE TABLE quoting. Every reference to the table name should be quoted consistently.

Comment on lines +186 to +198
elif isinstance(value, (int, float)):
# Check if it's actually NaN
try:
is_na_num = pd.isna(value)
if hasattr(is_na_num, "__len__") and not isinstance(is_na_num, str):
is_na_num = False
except (ValueError, TypeError):
is_na_num = False

if is_na_num:
values.append("NULL")
else:
values.append(str(value))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code checks pd.isna(value) at line 176 and sends NaN values to NULL. Then in this int/float branch, it checks pd.isna(value) again. That second check can never be true, because if the value were NaN, the first check would have already handled it.

Suggested change
elif isinstance(value, (int, float)):
# Check if it's actually NaN
try:
is_na_num = pd.isna(value)
if hasattr(is_na_num, "__len__") and not isinstance(is_na_num, str):
is_na_num = False
except (ValueError, TypeError):
is_na_num = False
if is_na_num:
values.append("NULL")
else:
values.append(str(value))
elif isinstance(value, (int, float)):
values.append(str(value))

This replaces the 13-line block with the only branch that can actually execute.

Signed-off-by: Mike Fuller <mike@finops.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants