Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
aa5a136
[FEAT]register table using iceberg metadata file via pyiceberg (#711)
MehulBatra May 22, 2024
5537cb4
modify doc(backward compatibility) typo (#757)
SeungyeopShin May 23, 2024
e917660
Bump requests from 2.32.1 to 2.32.2 (#759)
dependabot[bot] May 23, 2024
7083b2e
Bump griffe from 0.45.0 to 0.45.1 (#760)
dependabot[bot] May 23, 2024
03a0d65
Bump mypy-boto3-glue from 1.34.88 to 1.34.110 (#761)
dependabot[bot] May 23, 2024
996afd0
Bump mkdocstrings-python from 1.10.2 to 1.10.3 (#762)
dependabot[bot] May 23, 2024
eba4bee
Initial implementation of the manifest table (#717)
geruh May 23, 2024
42afc43
Fix: Table-Exists if Server returns 204 (#739)
c-thiel May 23, 2024
959718a
Bump duckdb from 0.10.2 to 0.10.3 (#764)
dependabot[bot] May 25, 2024
ed83e84
Bump griffe from 0.45.1 to 0.45.2 (#765)
dependabot[bot] May 25, 2024
b8023d2
Bump typing-extensions from 4.11.0 to 4.12.0 (#767)
dependabot[bot] May 25, 2024
a132be1
Bump mkdocs-material from 9.5.24 to 9.5.25 (#770)
dependabot[bot] May 28, 2024
8968996
Add azure configuration variables (#745)
kevinzwang May 28, 2024
ee2a7c5
Bump moto from 5.0.7 to 5.0.8 (#771)
dependabot[bot] May 28, 2024
54aacb4
Bump coverage from 7.5.1 to 7.5.2 (#772)
dependabot[bot] May 28, 2024
756ae62
Introduce hierarchical namespaces into SqlCatalog (#591)
cccs-eric May 28, 2024
4fb8ba2
Bump coverage from 7.5.2 to 7.5.3 (#776)
dependabot[bot] May 29, 2024
ec8d7dc
Bump pydantic from 2.7.1 to 2.7.2 (#775)
dependabot[bot] May 29, 2024
7552e03
Bump requests from 2.32.2 to 2.32.3 (#778)
dependabot[bot] May 30, 2024
e08cc9d
Bump getdaft from 0.2.24 to 0.2.25 (#779)
dependabot[bot] May 30, 2024
d3ad61c
Remove `record_fields` from the `Record` class (#580)
Fokko May 30, 2024
cf3bf8a
Unify to double quotes using Ruff (#781)
HonahX May 30, 2024
91973f2
Bump moto from 5.0.8 to 5.0.9 (#783)
dependabot[bot] May 31, 2024
0339e7f
Support CreateTableTransaction for SqlCatalog (#684)
HonahX May 31, 2024
84a2c04
Support CreateTableTransaction for HiveCatalog (#683)
HonahX May 31, 2024
8d79664
Support viewfs scheme along side with hdfs (#777)
yothinix May 31, 2024
20f6afd
Update `fsspec.py`to respect `s3.signer.uri property` (#741)
c-thiel May 31, 2024
65a03d2
Support Appends with TimeTransform Partitions (#784)
sungwy May 31, 2024
31c6c23
Bump mypy-boto3-glue from 1.34.110 to 1.34.115 (#780)
dependabot[bot] Jun 2, 2024
e61ef57
Add `include_field_ids` flag in `schema_to_pyarrow` (#789)
sungwy Jun 3, 2024
18448fd
Support getting snapshot at or right before the given timestamp (#748)
chinmay-bhat Jun 3, 2024
a09b04c
Bump duckdb from 0.10.3 to 1.0.0 (#793)
dependabot[bot] Jun 4, 2024
3585778
Bump typing-extensions from 4.12.0 to 4.12.1 (#794)
dependabot[bot] Jun 4, 2024
a110368
Bump pydantic from 2.7.2 to 2.7.3 (#795)
dependabot[bot] Jun 4, 2024
9acad24
Bump mkdocs-material from 9.5.25 to 9.5.26 (#798)
dependabot[bot] Jun 6, 2024
0155405
Bump mypy-boto3-glue from 1.34.115 to 1.34.121 (#799)
dependabot[bot] Jun 6, 2024
33a0018
Bump typing-extensions from 4.12.1 to 4.12.2 (#802)
dependabot[bot] Jun 9, 2024
94e8a98
Bump getdaft from 0.2.25 to 0.2.27 (#801)
dependabot[bot] Jun 9, 2024
1b3673c
Set `AssertTableUUID` by default on a transaction (#804)
sungwy Jun 10, 2024
a6858f7
Bump pypa/cibuildwheel from 2.18.1 to 2.19.0 (#805)
dependabot[bot] Jun 11, 2024
df69165
Bump griffe from 0.45.2 to 0.45.3 (#806)
dependabot[bot] Jun 11, 2024
d01a7b5
Bump msal from 1.26.0 to 1.28.0 (#812)
dependabot[bot] Jun 12, 2024
de2b299
Bump azure-identity from 1.15.0 to 1.16.1 (#811)
dependabot[bot] Jun 12, 2024
194e2ef
Bump pydantic from 2.7.3 to 2.7.4 (#816)
dependabot[bot] Jun 14, 2024
2407a3c
Bump pypa/cibuildwheel from 2.19.0 to 2.19.1 (#814)
dependabot[bot] Jun 14, 2024
d4a4eed
Cast PyArrow schema to `large_*` types (#807)
sungwy Jun 14, 2024
c579e9f
Bump mypy-boto3-glue from 1.34.121 to 1.34.126 (#815)
dependabot[bot] Jun 14, 2024
1dde51a
Support snapshot management operations like creating tags by adding `…
chinmay-bhat Jun 15, 2024
772faad
Bump mkdocs-material from 9.5.26 to 9.5.27 (#826)
dependabot[bot] Jun 18, 2024
a32bd7b
Bump mypy-boto3-glue from 1.34.126 to 1.34.128 (#825)
dependabot[bot] Jun 18, 2024
f1e3107
Bump griffe from 0.45.3 to 0.46.1 (#824)
dependabot[bot] Jun 18, 2024
5c9fa7e
Bump urllib3 from 1.26.18 to 1.26.19 (#823)
dependabot[bot] Jun 18, 2024
a29491a
Remove recursive call from `ancestors_of` (#821)
ndrluis Jun 18, 2024
4c0d218
Bump mkdocstrings-python from 1.10.3 to 1.10.5 (#839)
dependabot[bot] Jun 20, 2024
25d5186
Bump griffe from 0.46.1 to 0.47.0 (#831)
dependabot[bot] Jun 20, 2024
a537d2a
Bump getdaft from 0.2.27 to 0.2.28 (#834)
dependabot[bot] Jun 20, 2024
4767d1d
Bump tenacity from 8.3.0 to 8.4.1 (#833)
dependabot[bot] Jun 20, 2024
a94463a
Bump sqlalchemy from 2.0.30 to 2.0.31 (#842)
dependabot[bot] Jun 20, 2024
34d08fd
Bump mypy-boto3-glue from 1.34.128 to 1.34.131 (#844)
dependabot[bot] Jun 21, 2024
2182060
Bump python-snappy from 0.7.1 to 0.7.2 (#843)
dependabot[bot] Jun 21, 2024
b8c5bb7
Support `Table.to_arrow_batch_reader` (#786)
sungwy Jun 21, 2024
e581b40
Github: Add 0.6.1 to issue template (#841)
Fokko Jun 24, 2024
8cdf4ab
🐛 Write fields instead of spec object (#846)
Fokko Jun 24, 2024
a6cd0cf
Bump tenacity from 8.4.1 to 8.4.2 (#852)
dependabot[bot] Jun 25, 2024
9cb3cd5
Metadata Log Entries metadata table (#667)
kevinjqliu Jun 26, 2024
132208a
Bump coverage from 7.5.3 to 7.5.4 (#854)
dependabot[bot] Jun 26, 2024
4049971
Add mkdocs toc config section (#858)
uatach Jun 26, 2024
0e381fa
Add history inspect table (#828)
ndrluis Jun 26, 2024
60458ab
Merge branch 'metadata-files-table' into merge-metadata-conflicts
Gowthami03B Jun 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/ISSUE_TEMPLATE/iceberg_bug_report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ body:
description: What Apache Iceberg version are you using?
multiple: false
options:
- "0.6.0 (latest release)"
- "0.6.1 (latest release)"
- "0.6.0"
- "0.5.0"
- "0.4.0"
- "0.3.0"
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/python-release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ jobs:
if: startsWith(matrix.os, 'ubuntu')

- name: Build wheels
uses: pypa/cibuildwheel@v2.18.1
uses: pypa/cibuildwheel@v2.19.1
with:
output-dir: wheelhouse
config-file: "pyproject.toml"
Expand Down
47 changes: 47 additions & 0 deletions dev/provision.py
Original file line number Diff line number Diff line change
Expand Up @@ -342,3 +342,50 @@
(array(), map(), array(struct(1)))
"""
)

spark.sql(
f"""
CREATE OR REPLACE TABLE {catalog_name}.default.test_table_snapshot_operations (
number integer
)
USING iceberg
TBLPROPERTIES (
'format-version'='2'
);
"""
)

spark.sql(
f"""
INSERT INTO {catalog_name}.default.test_table_snapshot_operations
VALUES (1)
"""
)

spark.sql(
f"""
INSERT INTO {catalog_name}.default.test_table_snapshot_operations
VALUES (2)
"""
)

spark.sql(
f"""
DELETE FROM {catalog_name}.default.test_table_snapshot_operations
WHERE number = 2
"""
)

spark.sql(
f"""
INSERT INTO {catalog_name}.default.test_table_snapshot_operations
VALUES (3)
"""
)

spark.sql(
f"""
INSERT INTO {catalog_name}.default.test_table_snapshot_operations
VALUES (4)
"""
)
125 changes: 125 additions & 0 deletions mkdocs/docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -606,6 +606,100 @@ min_snapshots_to_keep: [[null,10]]
max_snapshot_age_in_ms: [[null,604800000]]
```

### Manifests

To show a table's current file manifests:

```python
table.inspect.manifests()
```

```
pyarrow.Table
content: int8 not null
path: string not null
length: int64 not null
partition_spec_id: int32 not null
added_snapshot_id: int64 not null
added_data_files_count: int32 not null
existing_data_files_count: int32 not null
deleted_data_files_count: int32 not null
added_delete_files_count: int32 not null
existing_delete_files_count: int32 not null
deleted_delete_files_count: int32 not null
partition_summaries: list<item: struct<contains_null: bool not null, contains_nan: bool, lower_bound: string, upper_bound: string>> not null
child 0, item: struct<contains_null: bool not null, contains_nan: bool, lower_bound: string, upper_bound: string>
child 0, contains_null: bool not null
child 1, contains_nan: bool
child 2, lower_bound: string
child 3, upper_bound: string
----
content: [[0]]
path: [["s3://warehouse/default/table_metadata_manifests/metadata/3bf5b4c6-a7a4-4b43-a6ce-ca2b4887945a-m0.avro"]]
length: [[6886]]
partition_spec_id: [[0]]
added_snapshot_id: [[3815834705531553721]]
added_data_files_count: [[1]]
existing_data_files_count: [[0]]
deleted_data_files_count: [[0]]
added_delete_files_count: [[0]]
existing_delete_files_count: [[0]]
deleted_delete_files_count: [[0]]
partition_summaries: [[ -- is_valid: all not null
-- child 0 type: bool
[false]
-- child 1 type: bool
[false]
-- child 2 type: string
["test"]
-- child 3 type: string
["test"]]]
```

### Metadata Log Entries

To show table metadata log entries:

```python
table.inspect.metadata_log_entries()
```

```
pyarrow.Table
timestamp: timestamp[ms] not null
file: string not null
latest_snapshot_id: int64
latest_schema_id: int32
latest_sequence_number: int64
----
timestamp: [[2024-04-28 17:03:00.214,2024-04-28 17:03:00.352,2024-04-28 17:03:00.445,2024-04-28 17:03:00.498]]
file: [["s3://warehouse/default/table_metadata_log_entries/metadata/00000-0b3b643b-0f3a-4787-83ad-601ba57b7319.metadata.json","s3://warehouse/default/table_metadata_log_entries/metadata/00001-f74e4b2c-0f89-4f55-822d-23d099fd7d54.metadata.json","s3://warehouse/default/table_metadata_log_entries/metadata/00002-97e31507-e4d9-4438-aff1-3c0c5304d271.metadata.json","s3://warehouse/default/table_metadata_log_entries/metadata/00003-6c8b7033-6ad8-4fe4-b64d-d70381aeaddc.metadata.json"]]
latest_snapshot_id: [[null,3958871664825505738,1289234307021405706,7640277914614648349]]
latest_schema_id: [[null,0,0,0]]
latest_sequence_number: [[null,0,0,0]]
```

### History

To show a table's history:

```python
table.inspect.history()
```

```
pyarrow.Table
made_current_at: timestamp[ms] not null
snapshot_id: int64 not null
parent_id: int64
is_current_ancestor: bool not null
----
made_current_at: [[2024-06-18 16:17:48.768,2024-06-18 16:17:49.240,2024-06-18 16:17:49.343,2024-06-18 16:17:49.511]]
snapshot_id: [[4358109269873137077,3380769165026943338,4358109269873137077,3089420140651211776]]
parent_id: [[null,4358109269873137077,null,4358109269873137077]]
is_current_ancestor: [[true,false,true,true]]
```

### Files

Inspect the data files in the current snapshot of the table:
Expand Down Expand Up @@ -994,6 +1088,28 @@ tbl.overwrite(df, snapshot_properties={"abc": "def"})
assert tbl.metadata.snapshots[-1].summary["abc"] == "def"
```

## Snapshot Management

Manage snapshots with operations through the `Table` API:

```python
# To run a specific operation
table.manage_snapshots().create_tag(snapshot_id, "tag123").commit()
# To run multiple operations
table.manage_snapshots()
.create_tag(snapshot_id1, "tag123")
.create_tag(snapshot_id2, "tag456")
.commit()
# Operations are applied on commit.
```

You can also use context managers to make more changes:

```python
with table.manage_snapshots() as ms:
ms.create_branch(snapshot_id1, "Branch_A").create_tag(snapshot_id2, "tag789")
```

## Query the data

To query a table, a table scan is needed. A table scan accepts a filter, columns, optionally a limit and a snapshot ID:
Expand Down Expand Up @@ -1062,6 +1178,15 @@ tpep_dropoff_datetime: [[2021-04-01 00:47:59.000000,...,2021-05-01 00:14:47.0000

This will only pull in the files that that might contain matching rows.

One can also return a PyArrow RecordBatchReader, if reading one record batch at a time is preferred:

```python
table.scan(
row_filter=GreaterThanOrEqual("trip_distance", 10.0),
selected_fields=("VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime"),
).to_arrow_batch_reader()
```

### Pandas

<!-- prettier-ignore-start -->
Expand Down
3 changes: 2 additions & 1 deletion mkdocs/docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ For the FileIO there are several configuration options available:
| s3.access-key-id | admin | Configure the static secret access key used to access the FileIO. |
| s3.secret-access-key | password | Configure the static session token used to access the FileIO. |
| s3.signer | bearer | Configure the signature version of the FileIO. |
| s3.signer.uri | http://my.signer:8080/s3 | Configure the remote signing uri if it differs from the catalog uri. Remote signing is only implemented for `FsspecFileIO`. The final request is sent to `<s3.singer.uri>/v1/aws/s3/sign`. |
| s3.region | us-west-2 | Sets the region of the bucket |
| s3.proxy-uri | http://my.proxy.com:8080 | Configure the proxy server to be used by the FileIO. |
| s3.connect-timeout | 60.0 | Configure socket connection timeout, in seconds. |
Expand Down Expand Up @@ -298,4 +299,4 @@ PyIceberg uses multiple threads to parallelize operations. The number of workers

# Backward Compatibility

Previous versions of Java (`<1.4.0`) implementations incorrectly assume the optional attribute `current-snapshot-id` to be a required attribute in TableMetadata. This means that if `current-snapshot-id` is missing in the metadata file (e.g. on table creation), the application will throw an exception without being able to load the table. This assumption has been corrected in more recent Iceberg versions. However, it is possible to force PyIceberg to create a table with a metadata file that will be compatible with previous versions. This can be configured by setting the `legacy-current-snapshot-id` entry as "True" in the configuration file, or by setting the `LEGACY_CURRENT_SNAPSHOT_ID` environment variable. Refer to the [PR discussion](https://github.com/apache/iceberg-python/pull/473) for more details on the issue
Previous versions of Java (`<1.4.0`) implementations incorrectly assume the optional attribute `current-snapshot-id` to be a required attribute in TableMetadata. This means that if `current-snapshot-id` is missing in the metadata file (e.g. on table creation), the application will throw an exception without being able to load the table. This assumption has been corrected in more recent Iceberg versions. However, it is possible to force PyIceberg to create a table with a metadata file that will be compatible with previous versions. This can be configured by setting the `legacy-current-snapshot-id` entry as "True" in the configuration file, or by setting the `PYICEBERG_LEGACY_CURRENT_SNAPSHOT_ID` environment variable. Refer to the [PR discussion](https://github.com/apache/iceberg-python/pull/473) for more details on the issue
4 changes: 4 additions & 0 deletions mkdocs/docs/how-to-release.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,3 +214,7 @@ Thanks to everyone for contributing!
## Release the docs

A committer triggers the [`Python Docs` Github Actions](https://github.com/apache/iceberg-python/actions/workflows/python-ci-docs.yml) through the UI by selecting the branch that just has been released. This will publish the new docs.

## Update the Github template

Make sure to create a PR to update the [GitHub issues template](https://github.com/apache/iceberg-python/blob/main/.github/ISSUE_TEMPLATE/iceberg_bug_report.yml) with the latest version.
3 changes: 3 additions & 0 deletions mkdocs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,11 @@ theme:
toggle:
icon: material/brightness-4
name: Switch to light mode

markdown_extensions:
- admonition
- pymdownx.highlight:
anchor_linenums: true
- pymdownx.superfences
- toc:
permalink: true
6 changes: 3 additions & 3 deletions mkdocs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,13 @@
# under the License.

mkdocs==1.6.0
griffe==0.45.0
griffe==0.47.0
jinja2==3.1.4
mkdocstrings==0.25.1
mkdocstrings-python==1.10.2
mkdocstrings-python==1.10.5
mkdocs-literate-nav==0.6.1
mkdocs-autorefs==1.0.1
mkdocs-gen-files==0.5.0
mkdocs-material==9.5.24
mkdocs-material==9.5.27
mkdocs-material-extensions==1.3.1
mkdocs-section-index==0.3.9
Loading