Update arrow and support extract on durations and intervals by nrc · Pull Request #1 · pydantic/datafusion

nrc · 2024-08-22T21:50:27Z

No description provided.

* Implement native support StringView for overlay Signed-off-by: Chojan Shang <psiace@apache.org> * Re-write impl of overlay Signed-off-by: Chojan Shang <psiace@apache.org> * Minor update Signed-off-by: Chojan Shang <psiace@apache.org> * Add more tests Signed-off-by: Chojan Shang <psiace@apache.org> --------- Signed-off-by: Chojan Shang <psiace@apache.org>

…og exists (apache#11991)

* feat(11523): set the default memory pool to the tracked-consumer pool * test(11523): update tests for the OOM message including the top consumers * chore(11523): remove duplicate wording from OOM messages

* partial aggr for bool_*() * Use null filter

* feat/11953: Support StringView for TRANSLATE() fn Signed-off-by: Devan <devandbenz@gmail.com> * formatting Signed-off-by: Devan <devandbenz@gmail.com> * fixes internal error for GenericByteArray cast Signed-off-by: Devan <devandbenz@gmail.com> * adds additional TRANSLATE test Signed-off-by: Devan <devandbenz@gmail.com> * adds additional TRANSLATE test Signed-off-by: Devan <devandbenz@gmail.com> * rm unnecessary generic Signed-off-by: Devan <devandbenz@gmail.com> * cleanup + fix typo Signed-off-by: Devan <devandbenz@gmail.com> * cleanup + fix typo Signed-off-by: Devan <devandbenz@gmail.com> * adds some additional testing to sqllogictests for TRANSLATE string_view Signed-off-by: Devan <devandbenz@gmail.com> --------- Signed-off-by: Devan <devandbenz@gmail.com>

…pache#12016) * Handle arguments checking of min/max function to avoid crashes Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * Fix code format error --------- Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com>

…pache#11985)

…#12009) * Remove order_by on aggregate window functions since that operation is handled by the window function * Add unit test for window functions using udaf with ordering * Resolve clippy warning

…t usage (apache#11999) * fix error Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * use exec err Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* Improve performance of REPEAT functions Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * Improve performance of REPEAT functions Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * Fix cargo fmt Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> --------- Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com>

…or::state` (apache#12001) * Remove wrong comment * Remove wrong comment on Accumulator::state * Not call twice comment * Adjust comment order

apache#11994) * Minor: improve ParquetExec docs * typo * clippy * fix whitespace so rustdoc does not treat as tests * Apply suggestions from code review Co-authored-by: Oleks V <comphead@users.noreply.github.com> * expound upon column rewriting in the context of schema evolution --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com>

* feat: Add map_extract module and function * chore: Fix fmt * chore: Add tests * chore: Simplify * chore: Simplify * chore: Fix clippy * doc: Add user doc * feat: use Signature::user_defined * chore: Update tests * chore: Fix fmt * chore: Fix clippy * chore * chore: typo * chore: Check args len in return_type * doc: Update doc * chore: Simplify logic * chore: check args earlier * feat: Support UTF8VIEW * chore: Update doc * chore: Fic clippy * refacotr: Use MutableArrayData * chore * refactor: Avoid type conversion * chore: Fix clippy * chore: Follow DuckDB * Update datafusion/functions-nested/src/map_extract.rs Co-authored-by: Jay Zhan <jayzhan211@gmail.com> * chore: Fix fmt --------- Co-authored-by: Jay Zhan <jayzhan211@gmail.com>

…rate (apache#12036) * refactor: Move LimitedDistinctAggregation to physical-optimizer crate * chore: Update cargo.lock * chore: Fix clippy * Update datafusion/physical-optimizer/src/limited_distinct_aggregation.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * chore: Clean import --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

…#12030) * Adds new crate for window functions * Moves `row_number` to window functions crate * Fixes build errors * Regenerates protobuf * Makes `row_number` no-op temporarily * Minor: fixes formatting * Implements `WindowUDF` for `row_number` * Minor: fixes formatting * Adds singleton instance of UDWF: `row_number` * Adds partition evaluator * Registers default window functions * Implements `evaluate_all` * Fixes: allow non-uppercase globals * Minor: prefix underscore for unused variable * Minor: fixes formatting * Uses `row_number_udwf` * Fixes: unparser test for `row_number` * Uses row number to represent functional dependency * Minor: fixes formatting * Removes `row_number` from case-insensitive name test * Deletes wrapper for `row_number` window expression * Fixes: lowercase name in error statement * Fixes: `row_number` fields are not nullable * Fixes: lowercase name in explain output * Updates Cargo.lock * Fixes: lowercase name in explain output * Adds support for result ordering * Minor: add newline between methods * Fixes: re-export crate name in doc comments * Adds doc comment for `WindowUDFImpl::nullable` * Minor: renames variable * Minor: update doc comments * Deletes code * Minor: update doc comments * Minor: adds period * Adds doc comment for `row_number` window UDF * Adds fluent API for creating `row_number` expression * Minor: removes unnecessary path prefix * Adds roundtrip logical plan test case * Updates unit tests for `row_number` * Deletes code * Minor: copy edit doc comments * Minor: deletes comment * Minor: copy edits udwf doc comments

…pache#12000) * fix/11982: resolves projection issue found in with_column window fn usage Signed-off-by: Devan <devandbenz@gmail.com> * fix/11982: resolves projection issue found in with_column window fn usage Signed-off-by: Devan <devandbenz@gmail.com> * fmt Signed-off-by: Devan <devandbenz@gmail.com> * fmt Signed-off-by: Devan <devandbenz@gmail.com> * refactor to get tests working Signed-off-by: Devan <devandbenz@gmail.com> * change test to use test harness Signed-off-by: Devan <devandbenz@gmail.com> * use row_number method and add comment about test Signed-off-by: Devan <devandbenz@gmail.com> * add back import Signed-off-by: Devan <devandbenz@gmail.com> --------- Signed-off-by: Devan <devandbenz@gmail.com>

* Support HEAD of sqlparser main * special case ID as a non-keyword when unparsing * fix EXTRACT expresssions * TODO REVERT: comment out failing test Making this commit just to let tests progress. * use sqlparser-rs v0.50.0

* Minor: make some physical-plan properties public * add Default for GroupOrderingFull * make groups and null_expr private again * remove pub label

…pache#12018) Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com>

…pache#12100)

* Add Utf8View support to STRPOS function * fix type inconsistency * fix type inconsistency * refactor tests

* Update itertools requirement from 0.12 to 0.13 Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version. - [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md) - [Commits](rust-itertools/itertools@v0.12.0...v0.13.0) --- updated-dependencies: - dependency-name: itertools dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Update Cargo.lock * Avoid deprecated API * nested-functions: workspace version of itertools --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Eduard Karacharov <eduard.karacharov@gmail.com>

* fix the wildcard expand for filter plan * expand the wildcard for the error message * add the tests * fix recompute_schema * fix clippy * cargo fmt * change the check for having clause * rename the function and moving the tests * fix check * expand the schema for aggregate plan * reduce the time to expand wildcard * clean the testing table after tested * fmt and address review * stop expand wildcard and add more check for group-by and selects * simplify the having check

* Convert LogicalPlanBuilder to use Arc<LogicalPlan> Summary Update struct to use Arc. Verified test passes. Used Arc::clone as much as I can; and unwrap_arc when a owned LogicalPlan is required. Keep pub fn input unchanged as LogicalPlan to limit change scope. If we change the pub fn, we can also get rid of this pattern: ``` unnest(unwrap_arc(self.plan), ...).map(Self::from) ``` * Use self.plan directly without Arc::clone * Implement From to allow into syntax * Improve documentation * Consume self in to_recursive_query --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

…n`, add comments (apache#12102)

* Improve documentation on `StringArrayType` trait * tweaks * Update datafusion/functions/src/string/common.rs Co-authored-by: Oleks V <comphead@users.noreply.github.com> --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com>

…tead of deprecated Timestamp (apache#11597) * bump substrait-rs * consume and produce precisiontimestamps * bump substrait to latest * clippy * deprecate in 42, since we're already on 41

* optimize code * optimize code

…lly (apache#12098) * fix: UDF, UDAF, UDWF with_alias(..) should wrap the inner function fully * revert back to having Arc<Self> * add notes about adding stuff into Aliased impls * fix clippy --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* minor: SortExec measure elapsed_compute time when sorting Whilst investigating query execution performance I noticed that some SortExec nodes were reporting suspiciously short elapsed_compute times. It appears that the SortExec node wasn't running the elapsed_compute timer when it doing the actual sorting operation. * fix: apply review suggestions

* naive impl * calc capacity * cleanup * Update test * simplify coercion logic * write some more tests * Update tests * Improve implementation and do the right thing for null * add ticket reference --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

…pache#12111) * add a bench file substr.rs * taplo format --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

nrc · 2024-08-23T01:57:48Z

The official Arrow and ecosystem update is at apache#12032 and it is quite a big one this time around

Signed-off-by: Nick Cameron <nrc@ncameron.org>

davidhewitt · 2024-08-29T08:02:04Z

Will pull this commit onto a new branch 👍

davidhewitt · 2024-09-12T08:12:34Z

I pulled it onto https://github.com/pydantic/datafusion/tree/logfire-2024-08-29

PR feedback on apache#14057

…messages (apache#20387) ## Which issue does this PR close? - Closes apache#20386. ## Rationale for this change `memory_limit` (`RuntimeEnvBuilder::new().with_memory_limit()`) configuration uses `greedy` memory pool as `default`. However, if `memory_pool` (`RuntimeEnvBuilder::new().with_memory_pool()`) is set, it overrides by expected `memory_pool` config such as `fair`. Also, if both `memory_limit` and `memory_pool` configs are not set, `unbounded` memory pool will be used so it can be useful to expose `ultimately used/selected pool` as part of `ResourcesExhausted` error message for the end user awareness and the user may need to switch used memory pool (`greedy`, `fair`, `unbounded`), - Also, [this comparison table](lance-format/lance#3601 (comment)) is an example use-case for both `greedy` and `fair` memory pools runtime behaviors and this addition can help for this kind of comparison table by exposing used memory pool info as part of native logs. Please find following example use-cases by `datafusion-cli`: **Case1**: datafusion-cli result when `memory-limit` and `top-memory-consumers > 0` are set: ``` eren.avsarogullari@AWGNPWVK961 debug % ./datafusion-cli --memory-limit 10M --command 'select * from generate_series(1,500000) as t1(v1) order by v1;' --top-memory-consumers 3 DataFusion CLI v53.0.0 Error: Not enough memory to continue external sort. Consider increasing the memory limit config: 'datafusion.runtime.memory_limit', or decreasing the config: 'datafusion.execution.sort_spill_reservation_bytes'. caused by Resources exhausted: Additional allocation failed for ExternalSorter[0] with top memory consumers (across reservations) as: ExternalSorterMerge[0]#2(can spill: false) consumed 10.0 MB, peak 10.0 MB, DataFusion-Cli#0(can spill: false) consumed 0.0 B, peak 0.0 B, ExternalSorter[0]#1(can spill: true) consumed 0.0 B, peak 0.0 B. Error: Failed to allocate additional 128.0 KB for ExternalSorter[0] with 0.0 B already allocated for this reservation - 0.0 B remain available for the total memory pool: greedy(used: 10.0 MB, pool_size: 10.0 MB) ``` **Case2**: datafusion-cli result when `memory-limit` and `top-memory-consumers = 0` (disabling top memory consumers logging) are set: ``` eren.avsarogullari@AWGNPWVK961 debug % ./datafusion-cli --memory-limit 10M --command 'select * from generate_series(1,500000) as t1(v1) order by v1;' --top-memory-consumers 0 DataFusion CLI v53.0.0 Error: Not enough memory to continue external sort. Consider increasing the memory limit config: 'datafusion.runtime.memory_limit', or decreasing the config: 'datafusion.execution.sort_spill_reservation_bytes'. caused by Resources exhausted: Failed to allocate additional 128.0 KB for ExternalSorter[0] with 0.0 B already allocated for this reservation - 0.0 B remain available for the total memory pool: greedy(used: 10.0 MB, pool_size: 10.0 MB) ``` **Case3**: datafusion-cli result when only `memory-limit`, `memory-pool` and `top-memory-consumers > 0` are set: ``` eren.avsarogullari@AWGNPWVK961 debug % ./datafusion-cli --memory-limit 10M --mem-pool-type fair --top-memory-consumers 3 --command 'select * from generate_series(1,500000) as t1(v1) order by v1;' DataFusion CLI v53.0.0 Error: Not enough memory to continue external sort. Consider increasing the memory limit config: 'datafusion.runtime.memory_limit', or decreasing the config: 'datafusion.execution.sort_spill_reservation_bytes'. caused by Resources exhausted: Additional allocation failed for ExternalSorter[0] with top memory consumers (across reservations) as: ExternalSorterMerge[0]#2(can spill: false) consumed 10.0 MB, peak 10.0 MB, ExternalSorter[0]#1(can spill: true) consumed 0.0 B, peak 0.0 B, DataFusion-Cli#0(can spill: false) consumed 0.0 B, peak 0.0 B. Error: Failed to allocate additional 128.0 KB for ExternalSorter[0] with 0.0 B already allocated for this reservation - 0.0 B remain available for the total memory pool: fair(pool_size: 10.0 MB) ``` ## What changes are included in this PR? - Adding name property to MemoryPool instances, - Expose used MemoryPool info to Resources Exhausted error messages ## Are these changes tested? Yes and updating existing test cases. ## Are there any user-facing changes? Yes, being updated Resources Exhausted error messages.

PsiACE and others added 30 commits August 15, 2024 07:23

disable with_create_default_catalog_and_schema if the default catal…

cb3ec77

…og exists (apache#11991)

Use tracked-consumers memory pool be the default. (apache#11949)

4baa901

* feat(11523): set the default memory pool to the tracked-consumer pool * test(11523): update tests for the OOM message including the top consumers * chore(11523): remove duplicate wording from OOM messages

Update REVERSE scalar function to support Utf8View (apache#11973)

06bcf33

Support partial aggregation skip for boolean functions (apache#11847)

41f6dd9

* partial aggr for bool_*() * Use null filter

Update SPLIT_PART scalar function to support Utf8View (apache#11975)

c1fb989

Handle arguments checking of min/max function to avoid crashes (a…

36158b6

…pache#12016) * Handle arguments checking of min/max function to avoid crashes Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * Fix code format error --------- Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com>

Fix: support NULL input for regular expression comparison operations (a…

19ad53d

…pache#11985)

Remove physical sort parameters on aggregate window functions (apache…

9d1cf74

…#12009) * Remove order_by on aggregate window functions since that operation is handled by the window function * Add unit test for window functions using udaf with ordering * Resolve clippy warning

Fix: support NULL input for like operations (apache#12025)

58075e2

Minor: Add error tests for min/max with 2 arguments (apache#12024)

57d5e0e

fix: incorrect aggregation result of bool_and (apache#12017)

5db036e

support Utf8View (apache#12019)

300a08c

Minor: Remove wrong comment on Accumulator::evaluate and `Accumulat…

08f6e54

…or::state` (apache#12001) * Remove wrong comment * Remove wrong comment on Accumulator::state * Not call twice comment * Adjust comment order

Minor: cleanup .gitignore (apache#12035)

dc84fa5

catalog.has_header true by default (apache#11919)

b06e8b0

Update to sqlparser-rs v0.50.0 (apache#12014)

7fa7689

* Support HEAD of sqlparser main * special case ID as a non-keyword when unparsing * fix EXTRACT expresssions * TODO REVERT: comment out failing test Making this commit just to let tests progress. * use sqlparser-rs v0.50.0

Minor: make some physical-plan properties public (apache#12022)

186ba4c

* Minor: make some physical-plan properties public * add Default for GroupOrderingFull * make groups and null_expr private again * remove pub label

chore: improve variable naming conventions (apache#12042)

1f90b00

Fix: handle NULL input for regex match operations (apache#12028)

e84f343

Fix compilation, change row_number() expr_fn to 0 args (apache#12043)

cb1e3f0

Minor: Remove warning when building datafusion-cli from Dockerfile (a…

cd9237f

…pache#12018) Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com>

HuSen8891 and others added 18 commits August 21, 2024 13:03

Add test to verify count aggregate function should not be nullable (a…

ad583a8

…pache#12100)

Minor: Extract BatchCoalescer to its own module (apache#12047)

121f330

Add Utf8View support to STRPOS function (apache#12087)

c6be00d

* Add Utf8View support to STRPOS function * fix type inconsistency * fix type inconsistency * refactor tests

fix: ser/der fetch in CoalesceBatchesExec (apache#12107)

4c3b744

Minor: rename dictionary_coercion to `dictionary_comparison_coercio…

902f1c6

…n`, add comments (apache#12102)

feat: use Substrait's PrecisionTimestamp and PrecisionTimestampTz ins…

89cb6a2

…tead of deprecated Timestamp (apache#11597) * bump substrait-rs * consume and produce precisiontimestamps * bump substrait to latest * clippy * deprecate in 42, since we're already on 41

Improve split_part udf by using a GenericStringBuilder (apache#12093)

beb3d5a

* optimize code * optimize code

Fix compialtion on main (apache#12108)

9a1a92d

Update row_number.rs (apache#12110)

cafbb04

Add benchmark for SUBSTR to evaluate improvements using StringView (a…

b8b76bc

…pache#12111) * add a bench file substr.rs * taplo format --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Support HEAD of sqlparser main

9028d2b

adriangb and others added 2 commits August 23, 2024 13:59

Support arrow @ HEAD

620fa70

Support extract on intervals

6140c5f

Signed-off-by: Nick Cameron <nrc@ncameron.org>

nrc force-pushed the logfire-interval-extract branch from 0276b57 to 6140c5f Compare August 23, 2024 02:24

davidhewitt deleted the branch pydantic:logfire August 29, 2024 08:01

davidhewitt closed this Aug 29, 2024

davidhewitt reopened this Aug 29, 2024

davidhewitt closed this Sep 12, 2024

adriangb pushed a commit that referenced this pull request Jan 31, 2025

Merge pull request #1 from pydantic/pr-feedback

190db1f

PR feedback on apache#14057

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update arrow and support extract on durations and intervals#1

Update arrow and support extract on durations and intervals#1
nrc wants to merge 74 commits intopydantic:logfirefrom
nrc:logfire-interval-extract

nrc commented Aug 22, 2024

Uh oh!

nrc commented Aug 23, 2024

Uh oh!

davidhewitt commented Aug 29, 2024

Uh oh!

davidhewitt commented Sep 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

nrc commented Aug 22, 2024

Uh oh!

nrc commented Aug 23, 2024

Uh oh!

davidhewitt commented Aug 29, 2024

Uh oh!

davidhewitt commented Sep 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants