Skip to content

fix: always return ISO-8601 from datetime postproc (#484)#512

Merged
johnnygreco merged 6 commits intomainfrom
johnny/fix/484-datetime-postproc-iso8601
Apr 9, 2026
Merged

fix: always return ISO-8601 from datetime postproc (#484)#512
johnnygreco merged 6 commits intomainfrom
johnny/fix/484-datetime-postproc-iso8601

Conversation

@johnnygreco
Copy link
Copy Markdown
Contributor

📋 Summary

The DatetimeFormatMixin.postproc method used data-distribution heuristics to auto-detect output formatting, which silently stripped date/time components for small datasets (e.g., single-record previews) or narrow date ranges. This replaces the heuristics with deterministic ISO-8601 output via vectorized strftime, and corrects the convert_to docstring which omitted datetime-specific strftime usage.

🔗 Related Issue

Fixes #484

🔄 Changes

🐛 Fixed

  • Removed adaptive formatting heuristics from DatetimeFormatMixin.postproc — when convert_to is None, always returns ISO-8601 via vectorized strftime("%Y-%m-%dT%H:%M:%S") instead of row-by-row apply(lambda) (3bd9fd0)
  • Updated SamplerColumnConfig.convert_to docstring and Field description — previously said "must be one of float, int, or str" but datetime/timedelta samplers accept strftime format strings (888cb84)
  • Added docstring to DatetimeFormatMixin documenting the ISO-8601 default behavior (888cb84)

🧪 Tests

  • Updated existing tests (test_datetime_formats, test_timedelta) to expect ISO-8601 output
  • Added unit tests for postproc: single record, same-month, same-day, stdlib datetime.fromisoformat() compatibility, round-trip value preservation
  • Added integration tests: all 6 unit granularities at preview sizes, pd.to_datetime() round-trip, timedelta single record, timedelta hourly units, multiple datetime columns with mixed convert_to, narrow single-day range

🧪 Testing

  • make test passes (656 passed across config + engine)
  • Unit tests added/updated
  • E2E tests added/updated (N/A — sampler-level fix)

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated (N/A)

@johnnygreco johnnygreco requested a review from a team as a code owner April 8, 2026 17:04
@johnnygreco johnnygreco force-pushed the johnny/fix/484-datetime-postproc-iso8601 branch from 888cb84 to 37b2f7c Compare April 8, 2026 17:06
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Code Review: PR #512 — fix: always return ISO-8601 from datetime postproc (#484)

Summary

This PR replaces non-deterministic, heuristic-based datetime formatting in DatetimeFormatMixin.postproc with a single, deterministic ISO-8601 output path via vectorized strftime. Previously, the method used data-distribution heuristics (nunique, sum) to guess an appropriate format, which silently stripped date/time components for small datasets or narrow ranges — a correctness bug especially visible in single-record previews. The fix also updates the convert_to docstring and Field description to document that datetime/timedelta samplers accept strftime format strings, not just "float"/"int"/"str".

Files changed: 4 (2 source, 2 test), +219 / -14 lines.

Findings

Correctness

  1. Core fix is clean and correct. The replacement of ~8 lines of branching heuristics with a single series.dt.strftime("%Y-%m-%dT%H:%M:%S") call is the right approach. It eliminates the data-dependent formatting that caused Datetime sampler column strips out data when generating small datasets #484 and is both simpler and faster (vectorized vs. row-by-row apply(lambda)). No behavioral regression is possible for users who already set convert_to explicitly, since that branch is unchanged.

  2. ISO-8601 format choice is sound. The format %Y-%m-%dT%H:%M:%S produces strings parseable by both datetime.fromisoformat() (stdlib) and pd.to_datetime(), which the new tests verify. This ensures downstream consumers can round-trip the output reliably.

  3. No timezone handling. The format string omits timezone info (%z/%Z), which is consistent with the existing behavior and the fact that the datetime sampler generates timezone-naive timestamps. No issue here, but worth noting for future timezone-aware sampler work.

Documentation

  1. Docstring and Field description updates are accurate. The convert_to docstring now correctly documents strftime format string support for datetime/timedelta samplers and the ISO-8601 default. The DatetimeFormatMixin class docstring is a welcome addition.

  2. (Nitpick) Minor formatting inconsistency in the docstring. The ISO-8601 representation is written as YYYY-MM-DDTHH:MM:SS in the docstring, where MM is used for both month and minute. This is a display convention (not a strftime directive), so it's technically ambiguous — YYYY-MM-DDTHH:mm:SS or YYYY-MM-DDThh:mm:ss would be more precise per ISO-8601 conventions. Very minor, not blocking.

Tests

  1. Excellent test coverage. The PR adds 7 new unit tests for DatetimeFormatMixin.postproc directly (single record, same-month, same-day, parseable, stdlib fromisoformat, round-trip) and 8 new integration tests through DatasetGenerator (all 6 unit granularities at preview size, pd.to_datetime round-trip, timedelta single record, timedelta hourly, multiple mixed columns, narrow single-day range). Each test targets a specific failure mode of the old heuristic code.

  2. Existing test updates are correct. The two pattern changes in test_datetime_formats and test_timedelta correctly update expectations from partial-date formats to full ISO-8601 patterns (\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}).

  3. (Nitpick) test_timedelta_hourly_units upper bound assertion. The test asserts deltas < pd.Timedelta(hours=4) but the params set dt_max=4. Depending on whether dt_max is inclusive or exclusive in the timedelta sampler, this boundary could be flaky. If dt_max is inclusive, the assertion should be <=. This is a pre-existing concern with the sampler contract, not introduced by this PR, but worth verifying.

Design

  1. No breaking change for explicit convert_to users. The convert_to is not None branch is untouched, so users who already pass a strftime string see identical behavior. Only the implicit/default path changes, which is the correct scope.

  2. Performance improvement as a side effect. Replacing row-by-row apply(lambda) with vectorized dt.strftime is a meaningful performance win for large datasets. Good.

Verdict

LGTM. This is a well-scoped, well-tested bug fix. The core change is a single-line replacement that eliminates a class of data-dependent formatting bugs. Documentation is updated appropriately. Test coverage is thorough and specifically targets the failure modes of the old code. The two nitpicks above are non-blocking.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 8, 2026

Greptile Summary

This PR replaces the heuristic-based DatetimeFormatMixin.postproc (which silently dropped time or date components based on cardinality of months/days/hours) with a single vectorized series.dt.strftime(\"%Y-%m-%dT%H:%M:%S\") call, always emitting ISO-8601. It also corrects the SamplerColumnConfig.convert_to docstring to mention strftime format strings for datetime/timedelta samplers.

Confidence Score: 5/5

Safe to merge — deterministic ISO-8601 fix with no logic errors and thorough test coverage.

All changed files are correct: the heuristic removal is clean, the generator's two-pass design (inject → postproc) ensures TimeDeltaSampler still reads raw datetime64 from its reference column, and the new tests cover single-record, narrow-range, round-trip, and multi-column scenarios. No P0/P1 findings.

No files require special attention.

Vulnerabilities

No security concerns identified.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/sampling_gen/data_sources/base.py Core fix: removes 5-branch heuristic from DatetimeFormatMixin.postproc and replaces it with a single vectorized strftime call; logic is correct and the generator's two-pass design (inject then postproc) ensures TimeDeltaSampler still reads raw datetime64 data from reference columns.
packages/data-designer-config/src/data_designer/config/column_configs.py Docstring-only update: convert_to field description now correctly documents strftime format strings for datetime/timedelta samplers and the ISO-8601 default; no logic changes.
packages/data-designer-engine/tests/engine/sampling_gen/data_sources/test_sources.py Adds five targeted unit tests for postproc: single record, same-month, stdlib fromisoformat compatibility, and round-trip preservation; all assertions are correct.
packages/data-designer-engine/tests/engine/sampling_gen/test_generator.py Updates existing datetime/timedelta tests to expect full ISO-8601 and adds nine integration tests covering all unit granularities, narrow ranges, mixed convert_to configs, and pd.to_datetime round-trips; test bounds are consistent with randint's exclusive upper bound.
packages/data-designer/tests/interface/test_data_designer.py Adds a regression test for issue #484 at the DataDesigner interface level: single-record preview must return a full ISO-8601 timestamp, not a bare year string.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[DatetimeFormatMixin.postproc called] --> B{convert_to is not None?}
    B -- Yes --> C["series.dt.strftime(convert_to)\nUser-supplied format string"]
    B -- No --> D["series.dt.strftime('%Y-%m-%dT%H:%M:%S')\nDeterministic ISO-8601"]
    D --> E["e.g. '2024-01-15T09:30:00'"]
    C --> F["e.g. '01/15/2024 09:30'"]

    style D fill:#d4edda,stroke:#28a745
    style E fill:#d4edda,stroke:#28a745
Loading

Reviews (6): Last reviewed commit: "Merge branch 'main' into johnny/fix/484-..." | Re-trigger Greptile

@johnnygreco johnnygreco force-pushed the johnny/fix/484-datetime-postproc-iso8601 branch from 37b2f7c to 0e47db0 Compare April 8, 2026 17:09
andreatgretel
andreatgretel previously approved these changes Apr 9, 2026
Copy link
Copy Markdown
Contributor

@andreatgretel andreatgretel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀 Left a couple minor nits on the test file.

The DatetimeFormatMixin.postproc heuristics inferred output format from
value distribution, silently stripping date/time components for small
datasets or narrow date ranges. Replace with deterministic ISO-8601
output via vectorized strftime. Users who need custom formats can still
set convert_to on the SamplerColumnConfig.
The SamplerColumnConfig.convert_to docstring incorrectly stated that
only "float", "int", or "str" are accepted. Datetime/timedelta samplers
accept strftime format strings. Also document the ISO-8601 default.
Captures the exact reproducer from the issue: a single-record datetime
preview through the public DataDesigner.preview() interface must return
a full ISO-8601 timestamp, not a bare year string.
- Remove postproc_same_day_records (subsumed by same_month + no_convert_to)
- Remove postproc_always_parseable (subsumed by stdlib_fromisoformat)
- Remove all_same_month integration test (subsumed by narrow_range_single_day)
- Update single_record test to use unit="h" matching the issue reproducer
@johnnygreco johnnygreco force-pushed the johnny/fix/484-datetime-postproc-iso8601 branch from 9802f90 to a58efe9 Compare April 9, 2026 14:56
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Docs preview: https://837d1e89.dd-docs-preview.pages.dev

Notebook tutorials are placeholder-only in previews.

@johnnygreco johnnygreco merged commit 4a28136 into main Apr 9, 2026
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Datetime sampler column strips out data when generating small datasets

2 participants