fix: always return ISO-8601 from datetime postproc (#484)#512
fix: always return ISO-8601 from datetime postproc (#484)#512johnnygreco merged 6 commits intomainfrom
Conversation
888cb84 to
37b2f7c
Compare
Code Review: PR #512 — fix: always return ISO-8601 from datetime postproc (#484)SummaryThis PR replaces non-deterministic, heuristic-based datetime formatting in Files changed: 4 (2 source, 2 test), +219 / -14 lines. FindingsCorrectness
Documentation
Tests
Design
VerdictLGTM. This is a well-scoped, well-tested bug fix. The core change is a single-line replacement that eliminates a class of data-dependent formatting bugs. Documentation is updated appropriately. Test coverage is thorough and specifically targets the failure modes of the old code. The two nitpicks above are non-blocking. |
Greptile SummaryThis PR replaces the heuristic-based
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/sampling_gen/data_sources/base.py | Core fix: removes 5-branch heuristic from DatetimeFormatMixin.postproc and replaces it with a single vectorized strftime call; logic is correct and the generator's two-pass design (inject then postproc) ensures TimeDeltaSampler still reads raw datetime64 data from reference columns. |
| packages/data-designer-config/src/data_designer/config/column_configs.py | Docstring-only update: convert_to field description now correctly documents strftime format strings for datetime/timedelta samplers and the ISO-8601 default; no logic changes. |
| packages/data-designer-engine/tests/engine/sampling_gen/data_sources/test_sources.py | Adds five targeted unit tests for postproc: single record, same-month, stdlib fromisoformat compatibility, and round-trip preservation; all assertions are correct. |
| packages/data-designer-engine/tests/engine/sampling_gen/test_generator.py | Updates existing datetime/timedelta tests to expect full ISO-8601 and adds nine integration tests covering all unit granularities, narrow ranges, mixed convert_to configs, and pd.to_datetime round-trips; test bounds are consistent with randint's exclusive upper bound. |
| packages/data-designer/tests/interface/test_data_designer.py | Adds a regression test for issue #484 at the DataDesigner interface level: single-record preview must return a full ISO-8601 timestamp, not a bare year string. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[DatetimeFormatMixin.postproc called] --> B{convert_to is not None?}
B -- Yes --> C["series.dt.strftime(convert_to)\nUser-supplied format string"]
B -- No --> D["series.dt.strftime('%Y-%m-%dT%H:%M:%S')\nDeterministic ISO-8601"]
D --> E["e.g. '2024-01-15T09:30:00'"]
C --> F["e.g. '01/15/2024 09:30'"]
style D fill:#d4edda,stroke:#28a745
style E fill:#d4edda,stroke:#28a745
Reviews (6): Last reviewed commit: "Merge branch 'main' into johnny/fix/484-..." | Re-trigger Greptile
37b2f7c to
0e47db0
Compare
packages/data-designer-engine/tests/engine/sampling_gen/data_sources/test_sources.py
Show resolved
Hide resolved
packages/data-designer-engine/tests/engine/sampling_gen/data_sources/test_sources.py
Show resolved
Hide resolved
andreatgretel
left a comment
There was a problem hiding this comment.
LGTM! 🚀 Left a couple minor nits on the test file.
The DatetimeFormatMixin.postproc heuristics inferred output format from value distribution, silently stripping date/time components for small datasets or narrow date ranges. Replace with deterministic ISO-8601 output via vectorized strftime. Users who need custom formats can still set convert_to on the SamplerColumnConfig.
The SamplerColumnConfig.convert_to docstring incorrectly stated that only "float", "int", or "str" are accepted. Datetime/timedelta samplers accept strftime format strings. Also document the ISO-8601 default.
Captures the exact reproducer from the issue: a single-record datetime preview through the public DataDesigner.preview() interface must return a full ISO-8601 timestamp, not a bare year string.
- Remove postproc_same_day_records (subsumed by same_month + no_convert_to) - Remove postproc_always_parseable (subsumed by stdlib_fromisoformat) - Remove all_same_month integration test (subsumed by narrow_range_single_day) - Update single_record test to use unit="h" matching the issue reproducer
… redundant isinstance
9802f90 to
a58efe9
Compare
|
Docs preview: https://837d1e89.dd-docs-preview.pages.dev
|
📋 Summary
The
DatetimeFormatMixin.postprocmethod used data-distribution heuristics to auto-detect output formatting, which silently stripped date/time components for small datasets (e.g., single-record previews) or narrow date ranges. This replaces the heuristics with deterministic ISO-8601 output via vectorizedstrftime, and corrects theconvert_todocstring which omitted datetime-specific strftime usage.🔗 Related Issue
Fixes #484
🔄 Changes
🐛 Fixed
DatetimeFormatMixin.postproc— whenconvert_toisNone, always returns ISO-8601 via vectorizedstrftime("%Y-%m-%dT%H:%M:%S")instead of row-by-rowapply(lambda)(3bd9fd0)SamplerColumnConfig.convert_todocstring and Field description — previously said "must be one of float, int, or str" but datetime/timedelta samplers accept strftime format strings (888cb84)DatetimeFormatMixindocumenting the ISO-8601 default behavior (888cb84)🧪 Tests
test_datetime_formats,test_timedelta) to expect ISO-8601 outputpostproc: single record, same-month, same-day, stdlibdatetime.fromisoformat()compatibility, round-trip value preservationpd.to_datetime()round-trip, timedelta single record, timedelta hourly units, multiple datetime columns with mixedconvert_to, narrow single-day range🧪 Testing
make testpasses (656 passed across config + engine)✅ Checklist