-
Notifications
You must be signed in to change notification settings - Fork 4.5k
fix(parquetio): handle missing nullable fields in row conversion #35948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add null value handling when converting rows to Arrow tables for nullable fields that are missing from input data. This fixes KeyError when writing to Parquet with missing nullable fields, addressing issue apache#35791.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @liferoad, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request resolves a KeyError encountered in Apache Beam's Parquet I/O connector when writing data to Parquet files. Previously, if nullable fields were not present in the input data dictionaries, the write operation would fail. The implemented fix ensures that None is automatically used for such missing nullable fields, aligning with the defined schema and preventing runtime errors during data conversion to Arrow tables. A new test case has been added to validate this behavior.
Highlights
- Parquet Write Logic: Modified the
_ParquetWriteFnto gracefully handle cases where nullable fields are absent from the input row dictionaries. Instead of raising aKeyError, it now insertsNonefor these missing fields, ensuring robust data conversion to Arrow tables. - Test Coverage: Introduced a new test case,
test_write_with_nullable_fields_missing_data, which specifically validates the fix by simulating input data with missing nullable fields and asserting that the Parquet write operation completes successfully withNonevalues inserted as expected.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
|
Assigning reviewers: R: @shunping for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
shunping
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* sdks/python: properly make milvus as extra dependency * sdks/python: update image requirements * .github: trigger postcommit python * sdks/python: fix linting issues * sdks/python: fix formatting issues * .github: trigger beam postcommit python * sdks/python: revert milvus version in itests * sdks/python: update image requirements * trigger_files: trigger postcommit python * Bump github.com/docker/go-connections from 0.5.0 to 0.6.0 in /sdks (#35906) Bumps [github.com/docker/go-connections](https://github.com/docker/go-connections) from 0.5.0 to 0.6.0. - [Commits](docker/go-connections@v0.5.0...v0.6.0) --- updated-dependencies: - dependency-name: github.com/docker/go-connections dependency-version: 0.6.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Add the readme link to new YAML examples (#35941) * Bump google.golang.org/api from 0.247.0 to 0.248.0 in /sdks (#35969) * Remove mysql-connector-python dependency (#35932) * Fix typos and update test implementation from #35656 (#35958) * implement lambda name pickling in cloudpickle * add enable_lambda_name to __init__ * fix formatting and lint * fix typo * fix code paths in test * fix tests * fix lint * fix formatting and failing test * fix formatting again * remove cloudpickle implementation to leave only typo fixes and fixing test structure. * fix _make_function typo * revert regex * fix failing tests * fix formatting * update prefix to not hardcode * feat(mongodb): upgrade MongoDB Java driver to version 5.5.0 (#35946) * feat(mongodb): upgrade MongoDB Java driver to version 5.5.0 Update MongoDB Java driver from 3.12.11 to 5.5.0 and refactor code to use new API Add mongo-bson dependency required by new driver version Replace deprecated MongoClient with MongoClients and update GridFS implementation * refactor(mongodb): update MongoDB client usage to modern API Replace deprecated MongoClient with MongoClients.create() and update database drop method * build(dependencies): add mongodb driver core dependency Add mongodb-driver-core to support MongoDB Java driver functionality. Also mark mongo_java_driver as permitUnusedDeclared and add testImplementation. * fix(mongodb): update embedded mongo version and fix split key filtering Update embedded MongoDB test dependency to version 3.5.4 and simplify split key filtering logic by using BsonObjectId for range queries. This ensures proper type handling when filtering MongoDB documents by _id field. * build: add mongodb-driver-core dependency Add mongodb-driver-core version 5.5.0 to support MongoDB Java driver functionality * use version * refactor: simplify mongo client creation logic Remove redundant null check and consolidate uri handling in MongoDbGridFSIO * Bump github.com/aws/aws-sdk-go-v2/credentials in /sdks (#35974) Bumps [github.com/aws/aws-sdk-go-v2/credentials](https://github.com/aws/aws-sdk-go-v2) from 1.18.6 to 1.18.7. - [Release notes](https://github.com/aws/aws-sdk-go-v2/releases) - [Changelog](https://github.com/aws/aws-sdk-go-v2/blob/config/v1.18.7/CHANGELOG.md) - [Commits](aws/aws-sdk-go-v2@config/v1.18.6...config/v1.18.7) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go-v2/credentials dependency-version: 1.18.7 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump google.golang.org/grpc from 1.74.2 to 1.75.0 in /sdks (#35971) Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.74.2 to 1.75.0. - [Release notes](https://github.com/grpc/grpc-go/releases) - [Commits](grpc/grpc-go@v1.74.2...v1.75.0) --- updated-dependencies: - dependency-name: google.golang.org/grpc dependency-version: 1.75.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Override localhost endpoint when a worker is running in docker on mac (#35964) * fix(parquetio): handle missing nullable fields in row conversion (#35948) * fix(parquetio): handle missing nullable fields in row conversion Add null value handling when converting rows to Arrow tables for nullable fields that are missing from input data. This fixes KeyError when writing to Parquet with missing nullable fields, addressing issue #35791. * fix lint * Bump cloud.google.com/go/storage from 1.56.0 to 1.56.1 in /sdks (#35980) Bumps [cloud.google.com/go/storage](https://github.com/googleapis/google-cloud-go) from 1.56.0 to 1.56.1. - [Release notes](https://github.com/googleapis/google-cloud-go/releases) - [Changelog](https://github.com/googleapis/google-cloud-go/blob/main/CHANGES.md) - [Commits](googleapis/google-cloud-go@spanner/v1.56.0...storage/v1.56.1) --- updated-dependencies: - dependency-name: cloud.google.com/go/storage dependency-version: 1.56.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [Prism] Fix segv when docker container self-terminated. (#35977) * Fix segv when docker container is self-terminated * Add some debug logging for docker and process env. * add a jinja % include/import pipeline example to docs (#35931) * add a jinja include pipeline example * update yaml doc with import example * address gemini and other comments * fix table of contents for readme * add link to jinja pipeline examples * Bump github.com/aws/aws-sdk-go-v2/config from 1.31.2 to 1.31.3 in /sdks (#35983) * Add a security GCP log analyzer (#35922) * Add the base log_analyzer * Add github action for security logging * Enhance LogAnalyzer to filter logs by time range and include file names in event summary * Add dry-run option for weekly email report generation in LogAnalyzer * Better error handling for timezones and missing details * Refactor LogAnalyzer to use SinkCls for type consistency and enhance bucket permission management for log sinks * update py containers (#35982) * [YAML]: add import jinja pipeline example (#35945) * add import jinja pipeline example * revert name change * update overall examples readme * fix lint issue * fix gemini small issue * Update sdks/python/apache_beam/yaml/examples/transforms/jinja/import/README.md --------- Co-authored-by: tvalentyn <tvalentyn@users.noreply.github.com> * workflows: capture DinD tests in PreCommit Py Coverage workflow * workflows: temporarily removing `ubuntu-latest` till resolving deps * workflows: add `matrix.os` label to `beam_PreCommit_Python_Coverage` --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Mohamed Awnallah <mohamedmohey2352@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Chamikara Jayalath <chamikaramj@gmail.com> Co-authored-by: Yi Hu <yathu@google.com> Co-authored-by: kristynsmith <kristynsmith@google.com> Co-authored-by: liferoad <huxiangqian@gmail.com> Co-authored-by: Shunping Huang <shunping@google.com> Co-authored-by: Derrick Williams <derrickaw@google.com> Co-authored-by: Enrique Calderon <71863693+ksobrenat32@users.noreply.github.com> Co-authored-by: Ahmed Abualsaud <65791736+ahmedabu98@users.noreply.github.com> Co-authored-by: tvalentyn <tvalentyn@users.noreply.github.com>
Add null value handling when converting rows to Arrow tables for nullable fields that are missing from input data. This fixes KeyError when writing to Parquet with missing nullable fields.
Fixes #35791.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.