Skip to content

Special case Duration display in datafusion-cli and sqllogictest #7070

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

In #6832 and #7068 , @tustvold proposes to use Duration consistently for timestamp arithmetic which makes the code simpler and consistent across DataFusion

However, one implication of this change, at the time of this writigng, is shown #6832 (comment)

Specifically, in DataFusion 28.0.0 the output of timestamp - timestamp is a interval and is displayed like this:

select (now() - '2023-01-01'::timestamp);
+-----------------------------------------------------------+
| now() - Utf8("2023-01-01")                                |
+-----------------------------------------------------------+
| 0 years 0 mons 204 days 16 hours 16 mins 8.121077000 secs |
+-----------------------------------------------------------+
1 row in set. Query took 0.004 seconds.

However, without any additional changes after #6832 it is displayed in ISO 8601 duration format

select (now() - '2023-01-01'::timestamp);
+----------------------------+
| now() - Utf8("2023-01-01") |
+----------------------------+
| P198DT72932.972880S        |
+----------------------------+
1 row in set. Query took 0.002 seconds.

I think this is problematic because the output:

  1. Looks wildly different than 28.0.0
  2. Looks wildly different than how intervals are expressed in queries (e.g. select now() + interval '1 year';)
  3. Will likely cause significant confusion as the ISO duration format, (e.g. P198DT72932.972880S), is not widely used for displaying intervals to humans
  4. The casting logic to strings will be different (see below)

Casting logic difference will means that the following query in 28.0.0

select (now() - '2023-01-01'::timestamp)::varchar || ' ago';
+----------------------------------------------------------------+
| now() - Utf8("2023-01-01") || Utf8(" ago")                     |
+----------------------------------------------------------------+
| 0 years 0 mons 204 days 16 hours 34 mins 15.407810000 secs ago |
+----------------------------------------------------------------+

Would produce this in 29.0.0:

+----------------------------------------------------------------+
| now() - Utf8("2023-01-01") || Utf8(" ago")                     |
+----------------------------------------------------------------+
| P198DT72932.972880S ago |
+----------------------------------------------------------------+

Describe the solution you'd like

I propose adding specific code for formatting DataFusion output, used in datafusion-cli, tests, and sqllogictests that formats Durations consistently with how Intervals are displayed in 28.0.0.

This change will unblock the arrow upgrade in #6832 as well as give us a single location to control formatting in DataFusion making it easier to improve the formatting of intervals / durations (or other types) in the future if we desire.

Anyone using DataFusion programmatically can either choose to use this new formatting routine, or can format Durations / Intervals to their specific need using arrow-rs or custom code (as we do in IOx, for example (source))

While this solution still suffers from the Duration --> String casting problem, I think it is ok because I don't think it is very common and there is a workaround (to cast the duration to an interval):

❯ select (now() - '2023-01-01'::timestamp)::interval::varchar || ' ago';
+----------------------------------------------------------------+
| now() - Utf8("2023-01-01") || Utf8(" ago")                     |
+----------------------------------------------------------------+
| 0 years 0 mons 204 days 16 hours 34 mins 15.407810000 secs ago |
+----------------------------------------------------------------+

For those people who want even more control of duration printing, it would probably make sense to add a specific to_char or similar function: https://www.postgresql.org/docs/current/functions-formatting.html

Describe alternatives you've considered

Change the parsing / formattting of Durations in arrow-rs

The idea is to make PRs like apache/arrow-rs#4557 that would change the formatting in arrow-rs

Some challenges with this approach:

  1. It is not clear if there is an "ideal" canonical format is for durations
  2. It would be a backwards compatible change that would likely impact other projects than DataFusion
  3. Existing data that was serialized with Durations (e.g. JSON or CSV) may be impacted

Special case String casting / parsing

One way to avoid issues here would be to just special case the casting and parsing (Duration <==> String) in datafusion and instead of calling the arrow cast function we could call datafusion_cast which would have special handling for some type and then fall back to the arrow cast kernel

The challenges with this approach are:

  1. It would be brittle (it would be hard to ensure that all new code called datafusion_cast rather than cast
  2. It is not clear that it would be entirely consistent (e.g. if there is code in arrow itself that calls cast)

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions