Release FOCUS Scrub 0.1.0#8
Conversation
Signed-off-by: Mike Fuller <mike@finops.org>
Signed-off-by: Mike Fuller <mike@finops.org>
Signed-off-by: Mike Fuller <mike@finops.org>
… types in TagsHandler
Added feature to drop vendor columns. Signed-off-by: Mike Fuller <mike@finops.org>
Added a new CLI option --scrub-tag-keys to allow scrubbing of tag keys in addition to tag values (default behavior only scrubs values). Signed-off-by: Mike Fuller <mike@finops.org>
Added a new CLI option --dates-only to only shift dates while leaving all other data unchanged, useful for re-dating FOCUS datasets without scrubbing sensitive information. Signed-off-by: Mike Fuller <mike@finops.org>
Bulk Import statements Signed-off-by: Mike Fuller <mike@finops.org>
Signed-off-by: Mike Fuller <mike@finops.org>
Signed-off-by: Mike Fuller <mike@finops.org>
Signed-off-by: Mike Fuller <mike@finops.org>
Matt-Cowsert
left a comment
There was a problem hiding this comment.
Solid release PR. The new features (SQL output, tag key scrubbing, dates-only mode, column dropping) are well-structured and thoroughly tested. The handler architecture with shared MappingEngine is clean. A few items need attention before merge.
Suggestions not in the diff
S3. .pre-commit-config.yaml lines 26-38 - Hardcoded path won't work for other contributors
Line 37 points to /Users/mfuller/git/github/FOCUS_Scrub/.venv/bin/python. That path only exists on one machine. Other contributors and CI will fail on pre-commit.
Suggested replacement for lines 26-38:
- repo: local
hooks:
- id: pytest
name: pytest
entry: poetry run pytest tests/ -v
language: system
pass_filenames: false
always_run: trueUsing poetry run makes it portable across machines.
Comments (non-blocking)
C1. handlers.py - _scramble_string is copy-pasted between two classes
ResourceIdHandler and TagsHandler both have identical _get_char_mapping and _scramble_string methods. Not urgent, but pulling these into a shared mixin or helper would cut the duplication.
C2. scrub.py line 24 + cli.py line 75 - x_Discounts default lives in two places
The argparse default in cli.py and the fallback in scrub.py both independently default to ["x_Discounts"]. The scrub.py fallback is never reached from the CLI path. If this default ever changes, it's easy to update one and miss the other.
C3. .github/workflows/ci.yml - Codecov uploads will silently fail
codecov-action@v4 needs a CODECOV_TOKEN repository secret to upload. Without one, the step passes (since fail_ci_if_error: false) but coverage never reaches Codecov.
| columns = df.columns.tolist() | ||
|
|
||
| # Write CREATE TABLE statement | ||
| f.write(f"CREATE TABLE IF NOT EXISTS {table_name} (\n") |
There was a problem hiding this comment.
| f.write(f"CREATE TABLE IF NOT EXISTS {table_name} (\n") | |
| f.write(f'CREATE TABLE IF NOT EXISTS "{table_name}" (\n') |
If someone passes --sql-table-name "order" or another SQL reserved word, this statement will fail when loaded into a database. Double-quoting the identifier (ANSI standard) handles reserved words and special characters.
|
|
||
| # Write CREATE TABLE statement | ||
| f.write(f"CREATE TABLE IF NOT EXISTS {table_name} (\n") | ||
| column_defs = [" id BIGINT AUTO_INCREMENT PRIMARY KEY"] |
There was a problem hiding this comment.
AUTO_INCREMENT is MySQL-only syntax. The README (line 191) says the SQL output works with "PostgreSQL, MySQL, SQLite, etc." but PostgreSQL uses GENERATED ALWAYS AS IDENTITY and SQLite uses AUTOINCREMENT (one word, different position).
Two ways to fix:
- Remove the synthetic
idcolumn (suggestion below). Most data loading workflows don't need one. - Keep it, but update the README to say the SQL targets MySQL specifically.
Up to you on which direction makes more sense.
If option 1:
| column_defs = [" id BIGINT AUTO_INCREMENT PRIMARY KEY"] | |
| column_defs = [] |
This would also require removing id from the README SQL examples.
| column_defs = [" id BIGINT AUTO_INCREMENT PRIMARY KEY"] | ||
| for col in columns: | ||
| sql_type = _pandas_dtype_to_sql_type(df[col].dtype) | ||
| column_defs.append(f" {col} {sql_type}") |
There was a problem hiding this comment.
| column_defs.append(f" {col} {sql_type}") | |
| column_defs.append(f' "{col}" {sql_type}') |
FOCUS column names are PascalCase and unlikely to collide with reserved words, but quoting costs nothing and prevents surprises with custom columns.
| return | ||
|
|
||
| # Column list for INSERT statements | ||
| column_list = ", ".join(f"{col}" for col in columns) |
There was a problem hiding this comment.
| column_list = ", ".join(f"{col}" for col in columns) | |
| column_list = ", ".join(f'"{col}"' for col in columns) |
Same reasoning as the column definition quoting. This ensures the INSERT column list matches the quoted CREATE TABLE definitions.
| batch_end = min(batch_start + batch_size, len(df)) | ||
| batch_df = df.iloc[batch_start:batch_end] | ||
|
|
||
| f.write(f"INSERT INTO {table_name} ({column_list})\n") |
There was a problem hiding this comment.
| f.write(f"INSERT INTO {table_name} ({column_list})\n") | |
| f.write(f'INSERT INTO "{table_name}" ({column_list})\n') |
Matches the CREATE TABLE quoting. Every reference to the table name should be quoted consistently.
| elif isinstance(value, (int, float)): | ||
| # Check if it's actually NaN | ||
| try: | ||
| is_na_num = pd.isna(value) | ||
| if hasattr(is_na_num, "__len__") and not isinstance(is_na_num, str): | ||
| is_na_num = False | ||
| except (ValueError, TypeError): | ||
| is_na_num = False | ||
|
|
||
| if is_na_num: | ||
| values.append("NULL") | ||
| else: | ||
| values.append(str(value)) |
There was a problem hiding this comment.
The code checks pd.isna(value) at line 176 and sends NaN values to NULL. Then in this int/float branch, it checks pd.isna(value) again. That second check can never be true, because if the value were NaN, the first check would have already handled it.
| elif isinstance(value, (int, float)): | |
| # Check if it's actually NaN | |
| try: | |
| is_na_num = pd.isna(value) | |
| if hasattr(is_na_num, "__len__") and not isinstance(is_na_num, str): | |
| is_na_num = False | |
| except (ValueError, TypeError): | |
| is_na_num = False | |
| if is_na_num: | |
| values.append("NULL") | |
| else: | |
| values.append(str(value)) | |
| elif isinstance(value, (int, float)): | |
| values.append(str(value)) |
This replaces the 13-line block with the only branch that can actually execute.
Signed-off-by: Mike Fuller <mike@finops.org>
Summary
This PR prepares the FOCUS Scrub 0.1.0 release by adding CI + coverage enforcement, expanding CLI capabilities (SQL output, column dropping, tag key scrubbing, dates-only mode), improving handler robustness, and significantly expanding test coverage and documentation.
Key changes
New output format: SQL
Adds sql as an output format alongside parquet and csv-gzip.
Generates:
CREATE TABLE IF NOT EXISTS ... DDL with inferred column types
batched bulk INSERT statements (1000 rows per batch)
optional custom table name via --sql-table-name
Implemented in focus_scrub/focus_scrub/io.py and wired through the CLI.
CLI enhancements
Adds new CLI flags in focus_scrub/focus_scrub/cli.py:
--drop-columns [COLUMN ...] (defaults to dropping x_Discounts; pass an empty list to keep it)
--scrub-tag-keys (scrub tag keys as well as values)
--dates-only (only shifts date fields; leaves other scrubbing disabled)
--sql-table-name (custom table name for SQL output)
Scrubbing behavior improvements
TagsHandler now supports:
native list and dict tag structures (not just string-encoded forms)
optional tag key scrubbing (scrub_tag_keys)
Handlers now guard pd.isna() calls to avoid errors on non-scalar / incompatible types.
Dataset handler selection supports dates_only mode (skips non-date handlers when enabled).
Column dropping
DataFrameScrub now supports drop_columns, with a default behavior to drop x_Discounts.
CI + coverage
Builds and Tests
Adds GitHub Actions workflow .github/workflows/ci.yml to run lint + tests + coverage, and upload coverage to Codecov.
Adds pytest-cov, coverage configuration, and enforces minimum 80% coverage via pytest.ini.
Updates Makefile with make coverage / make coverage-html.
Updates .gitignore to ignore coverage outputs.
Updates .pre-commit-config.yaml (ruff pre-commit hook revision bump and removes --fix arg).
Adds new test suites:
tests/test_cli.py (CLI integration tests for new flags and output formats)
tests/test_io.py (file discovery, IO, and SQL generation coverage)
Expands integration + handler tests for:
drop_columns default/override behavior
tag key scrubbing behavior and consistency
dates-only mode behavior
README expanded
clearer explanation of what is not scrubbed (metrics)
SQL output format docs + examples
--drop-columns, --scrub-tag-keys, and --dates-only usage
development/testing/coverage guidance
Files changed (high level)
CI/Tooling: .github/workflows/ci.yml, .pre-commit-config.yaml, pytest.ini, pyproject.toml, poetry.lock, Makefile, .gitignore
Core code: focus_scrub/focus_scrub/cli.py, focus_scrub/focus_scrub/io.py, focus_scrub/focus_scrub/handlers.py, focus_scrub/focus_scrub/scrub.py
Tests: tests/test_cli.py, tests/test_io.py, plus updates to tests/test_handlers.py, tests/test_integration.py