Skip to content

feat: support with_row_id options for spark connector#3084

Closed
SaintBacchus wants to merge 6 commits intolance-format:mainfrom
SaintBacchus:SparkSupportRowId
Closed

feat: support with_row_id options for spark connector#3084
SaintBacchus wants to merge 6 commits intolance-format:mainfrom
SaintBacchus:SparkSupportRowId

Conversation

@SaintBacchus
Copy link
Copy Markdown
Collaborator

  1. Java only has long type for 86 bit interge. And Spark arrow reader can't read the _rowid column for uint64 type.
  2. So wrap the array batch stream converting the uint64 into int64 and change the schema of _rowid from uint64 into int64

@SaintBacchus SaintBacchus marked this pull request as draft November 4, 2024 12:04
@github-actions github-actions Bot added enhancement New feature or request java labels Nov 4, 2024
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 4, 2024

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 4, 2024

Codecov Report

Attention: Patch coverage is 0% with 67 lines in your changes missing coverage. Please review.

Project coverage is 78.63%. Comparing base (ceaf49c) to head (620314b).
Report is 50 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-io/src/ffi.rs 0.00% 55 Missing ⚠️
java/core/lance-jni/src/blocking_scanner.rs 0.00% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3084      +/-   ##
==========================================
+ Coverage   77.31%   78.63%   +1.31%     
==========================================
  Files         240      243       +3     
  Lines       79322    82873    +3551     
  Branches    79322    82873    +3551     
==========================================
+ Hits        61326    65165    +3839     
- Misses      14821    14921     +100     
+ Partials     3175     2787     -388     
Flag Coverage Δ
unittests 78.63% <0.00%> (+1.31%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@broccoliSpicy broccoliSpicy requested a review from LuQQiu November 5, 2024 22:33
for field in res.clone().fields() {
if field.name() == ROW_ID {
let new_field = match field.data_type() {
DataType::UInt64 => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand why making this change
Arrow in Java also have DataType:UInt64 but Spark cannot convert UInt64 to SparkType?

Could we modify the Spark Type to Arrow Type conversion directly? Like having an extension??

@SaintBacchus
Copy link
Copy Markdown
Collaborator Author

image
@LuQQiu I plan to convert the type like this.

@LuQQiu
Copy link
Copy Markdown
Contributor

LuQQiu commented Nov 11, 2024

@SaintBacchus Can we support generic type conversion?
like convert all UInt64 in rust to BigInt/Long for java? no special handling for with_row_id, but more generic how should we process UInt64 in java/spark stack

@broccoliSpicy broccoliSpicy changed the title feat: Support with_row_id options for spark connector feat: support with_row_id options for spark connector Nov 15, 2024
wjones127 pushed a commit that referenced this pull request Dec 6, 2024
As discussion in [PR](#3084), I had
implement the _rowid meta column just in java package.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants