Skip to content

Conversation

@KnightChess
Copy link
Contributor

@github-actions github-actions bot added the docs label Nov 6, 2021
### All Data Files

To show a table's valid data files and each file's metadata, run:
> tip: a valid data file is one that is readable from any snapshot currently tracked by the table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notes in the doc are in the form of !!! Note, example: https://github.com/apache/iceberg/blob/master/site/docs/spark-writes.md?plain=1#L298-L301

Also we should add the warning that this table may return duplicate rows.

### All Manifests

To show a table's valid file manifests and each file's metadata, run:
> tip: a valid manifest file is one that is referenced from any snapshot currently tracked by the table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment as above, this table may return duplicate rows.

### All Entries

To show a table's valid manifest entries as rows, for both delete and data files, run:
> tip: a valid manifest file is one that is referenced from any snapshot currently tracked by the table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment as above, this table may return duplicate rows.

@KnightChess
Copy link
Contributor Author

I have modify it

```

Note:
!!! Note
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this create an empty note?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I will resole it like this:
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be in a note callout block. The previous formatting was just a paragraph, "Note:" with a list is better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KnightChess, can you fix this?

a valid manifest file is one that is referenced from any snapshot currently tracked by the table.

!!! WARNING
this table may return **duplicate** rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation should use correct capitalization. In this case, sentences should start with a capital letter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are several cases of sentences that don't start with capitals, so be sure to fix all of them, please!

+----------------+------------+----------+
```

### Entries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we support this, I'm not sure that we want this in the standard documentation for Spark users. The entries table will show all entries in manifests, for both deleted and live files in the table. I think interpreting this table is more of an advanced use.

What do you think @samredai?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I totally agree. I can't see a reason for a typical user to use this table and having it in this section of the docs could be a signal that they should be using it. There's also an instinct of data engineers coming from Hive tables to want to find the location of the data files and start messing around there so it's beneficial if we strengthen that abstraction by example in the docs (maybe even explicitly call it out)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I am late. I personally use this and have pointed others to build on this, because it's currently the only way to track when is the last time data is added, or how much data is added, to a partition.

But that being said, do understand it's a bit too complicated to document in the standard guide here, though just wondering what way advanced users can ever figure this out.

+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+
```

### All Data Files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to have a separate section for the "all" tables. These aren't needed nearly as often and require warnings, as you have in these sections.


### All Data Files

To show a table's valid data files and each file's metadata, run:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is a good idea to end a paragraph with "run:" and then put two callouts between it and the SQL that it is referencing.

| 0|s3:.../dt=20210102/xxx.parquet| PARQUET|{20210102}| 14| 2444|{1 -> 94, 2 -> 17}|{1 -> 14, 2 -> 14}| {1 -> 0, 2 -> 0}| {}|{1 -> 1, 2 -> 20210102}|{1 -> 2, 2 -> 20210102}| null| [4]| null| 0|
| 0|s3:.../dt=20210103/xxx.parquet| PARQUET|{20210103}| 14| 2444|{1 -> 94, 2 -> 17}|{1 -> 14, 2 -> 14}| {1 -> 0, 2 -> 0}| {}|{1 -> 1, 2 -> 20210103}|{1 -> 3, 2 -> 20210103}| null| [4]| null| 0|
| 0|s3:.../dt=20210104/xxx.parquet| PARQUET|{20210104}| 14| 2444|{1 -> 94, 2 -> 17}|{1 -> 14, 2 -> 14}| {1 -> 0, 2 -> 0}| {}|{1 -> 1, 2 -> 20210104}|{1 -> 3, 2 -> 20210104}| null| [4]| null| 0|
+-------+------------------------------+-----------+----------+------------+------------------+------------------+------------------+------------------+----------------+-----------------------+-----------------------+------------+-------------+------------+-------------+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the formatting of this table? You can see my recommendations on how to get a nicely formatted table here: #3422 (comment)

@KnightChess
Copy link
Contributor Author

@rdblue

  1. I have update all table format
  2. use three-level title all metadata to split it
  3. other fix

The metadata is readable from any snapshot currently tracked by the table

!!! WARNING
The table may return **duplicate** rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "the table"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my mistake, the table's metadata

@KnightChess KnightChess force-pushed the doc-for-system-tables branch from b5e6fc2 to 369f55c Compare November 10, 2021 05:04
Comment on lines 147 to 159
SELECT * FROM prod.db.table.history
```
```text
+-------------------------+---------------------+---------------------+---------------------+
| made_current_at | snapshot_id | parent_id | is_current_ancestor |
+-------------------------+---------------------+---------------------+---------------------+

<div class="markdown-table-container" markdown="block">
| made_current_at | snapshot_id | parent_id | is_current_ancestor |
| -- | -- | -- | -- |
| 2019-02-08 03:29:51.215 | 5781947118336215154 | NULL | true |
| 2019-02-08 03:47:55.948 | 5179299526185056830 | 5781947118336215154 | true |
| 2019-02-09 16:24:30.13 | 296410040247533544 | 5179299526185056830 | false |
| 2019-02-09 16:32:47.336 | 2999875608062437330 | 5179299526185056830 | true |
| 2019-02-09 19:42:03.919 | 8924558786060583479 | 2999875608062437330 | true |
| 2019-02-09 19:49:16.343 | 6536733823181975045 | 8924558786060583479 | true |
+-------------------------+---------------------+---------------------+---------------------+
```
</div>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szehon-ho here, and I use the css test in my local branch

.markdown-table-container {
  width: 780px;
  overflow-x: scroll;
}

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out @KnightChess ! If we're going to use the container class for all markdown tables, we can use overflow-x: auto; instead and it will only show the bar when there's overflow.

Copy link
Contributor

@samredai samredai Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prevent a merge conflict, I'll mention this over in PR #3422 so the css can be updated there.

@KnightChess KnightChess force-pushed the doc-for-system-tables branch from 19e3035 to f0b198c Compare November 17, 2021 02:46
@jackye1995 jackye1995 added this to the Iceberg 0.13.0 Release milestone Nov 18, 2021
</div>

!!! Note
Diffrent from metadata table [files](#files), This table exposes internal details, like files that have been deleted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this table (lower case)

</div>

!!! Note
Like metadata table [entries](#entries), contain delete data files too, different from [all_data_files](#all-data-files)
Copy link
Contributor

@jackye1995 jackye1995 Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sentence sounds a bit awkward, what about:

Like metadata table [entries](#entries), this metadata table is different from [all_data_files](#all-data-files) and contains deleted data files.

@jackye1995
Copy link
Contributor

@KnightChess any updates?

@KnightChess
Copy link
Contributor Author

@KnightChess any updates?

sorry, my fault. There was no change in the email I received so I did not confirm on github. I will do it today

@KnightChess
Copy link
Contributor Author

@jackye1995 cc

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the quick update, looks good to me

@jackye1995 jackye1995 requested review from rdblue and samredai December 7, 2021 02:34
!!! Note
Like metadata table [entries](#entries), this metadata table is different from [all_data_files](#all-data-files) and contains deleted data files.

!!! WARNING
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "Warning" rather than all caps.

@KnightChess KnightChess force-pushed the doc-for-system-tables branch from 3814a2b to aa6570b Compare December 24, 2021 08:12
@KnightChess
Copy link
Contributor Author

update and fix conflict from #3422 , the style for markdown table look good.
cc: @rdblue @samredai

### All Metadata

!!! Note
The metadata is readable from any snapshot currently tracked by the table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense to me.

| {20211002, 10}| 1| 1|
</div>

### All Metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be "All Metadata Tables" (note plural tables)

And this should have a paragraph explaining that these tables are unions of the metadata tables that are specific to a snapshot.

+----------------+------------+----------+
```

### Entries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I am late. I personally use this and have pointed others to build on this, because it's currently the only way to track when is the last time data is added, or how much data is added, to a partition.

But that being said, do understand it's a bit too complicated to document in the standard guide here, though just wondering what way advanced users can ever figure this out.

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @KnightChess , if you still have time, could you rebase and address the comments? I would like to move forward on it, I think it's a valuable change.

@ajantha-bhat
Copy link
Member

Hi @KnightChess , if you still have time, could you rebase and address the comments? I would like to move forward on it, I think it's a valuable change.

Agree, I thought of adding this doc few months ago, but because this PR was already there, I was waiting for it to getting merged.

@KnightChess : If you don't have time, you can let us know. I can add you as a coauthor and take this PR forward.

@KnightChess
Copy link
Contributor Author

@szehon-ho @ajantha-bhat sorry for the late reply, I will take this PR forward in the next few days.

@ajantha-bhat
Copy link
Member

@szehon-ho @ajantha-bhat sorry for the late reply, I will take this PR forward in the next few days.

Awesome. Looking forward to it.

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @KnightChess for continuing it, replied to some of the comments, and look forward to the changes. I'll be out next week, but can look if i have time, or after that.

@KnightChess KnightChess force-pushed the doc-for-system-tables branch from b821ae0 to 998f271 Compare March 26, 2022 09:57
@KnightChess
Copy link
Contributor Author

@szehon-ho update other except Advanced Usage, and adapt new doc styple for hugo

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for delay, I just got back from vacation. Thanks @KnightChess for rebase/fix, and others for confirming the earlier comment.

I made another round of review and some smaller comments, can you please take a look? The change is getting close, and should be good soon. Thanks again.

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @KnightChess I built and found just one broken link in the new doc, should be good after this.

@szehon-ho szehon-ho merged commit 49dad0c into apache:master Apr 8, 2022
@szehon-ho
Copy link
Member

Merged, thanks @KnightChess for your patience and contribution

@KnightChess
Copy link
Contributor Author

thanks @szehon-ho @rdblue @jackye1995 @samredai a lot for instruct patiently

@samredai
Copy link
Contributor

samredai commented Apr 9, 2022

Thanks @KnightChess!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants