-
Notifications
You must be signed in to change notification settings - Fork 3k
spark: supplement spark-queries.md doc #3482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
site/docs/spark-queries.md
Outdated
| ### All Data Files | ||
|
|
||
| To show a table's valid data files and each file's metadata, run: | ||
| > tip: a valid data file is one that is readable from any snapshot currently tracked by the table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
notes in the doc are in the form of !!! Note, example: https://github.com/apache/iceberg/blob/master/site/docs/spark-writes.md?plain=1#L298-L301
Also we should add the warning that this table may return duplicate rows.
site/docs/spark-queries.md
Outdated
| ### All Manifests | ||
|
|
||
| To show a table's valid file manifests and each file's metadata, run: | ||
| > tip: a valid manifest file is one that is referenced from any snapshot currently tracked by the table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment as above, this table may return duplicate rows.
site/docs/spark-queries.md
Outdated
| ### All Entries | ||
|
|
||
| To show a table's valid manifest entries as rows, for both delete and data files, run: | ||
| > tip: a valid manifest file is one that is referenced from any snapshot currently tracked by the table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment as above, this table may return duplicate rows.
|
I have modify it |
site/docs/spark-queries.md
Outdated
| ``` | ||
|
|
||
| Note: | ||
| !!! Note |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this create an empty note?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be in a note callout block. The previous formatting was just a paragraph, "Note:" with a list is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KnightChess, can you fix this?
site/docs/spark-queries.md
Outdated
| a valid manifest file is one that is referenced from any snapshot currently tracked by the table. | ||
|
|
||
| !!! WARNING | ||
| this table may return **duplicate** rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation should use correct capitalization. In this case, sentences should start with a capital letter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are several cases of sentences that don't start with capitals, so be sure to fix all of them, please!
site/docs/spark-queries.md
Outdated
| +----------------+------------+----------+ | ||
| ``` | ||
|
|
||
| ### Entries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we support this, I'm not sure that we want this in the standard documentation for Spark users. The entries table will show all entries in manifests, for both deleted and live files in the table. I think interpreting this table is more of an advanced use.
What do you think @samredai?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I totally agree. I can't see a reason for a typical user to use this table and having it in this section of the docs could be a signal that they should be using it. There's also an instinct of data engineers coming from Hive tables to want to find the location of the data files and start messing around there so it's beneficial if we strengthen that abstraction by example in the docs (maybe even explicitly call it out)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I am late. I personally use this and have pointed others to build on this, because it's currently the only way to track when is the last time data is added, or how much data is added, to a partition.
But that being said, do understand it's a bit too complicated to document in the standard guide here, though just wondering what way advanced users can ever figure this out.
site/docs/spark-queries.md
Outdated
| +-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+ | ||
| ``` | ||
|
|
||
| ### All Data Files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to have a separate section for the "all" tables. These aren't needed nearly as often and require warnings, as you have in these sections.
site/docs/spark-queries.md
Outdated
|
|
||
| ### All Data Files | ||
|
|
||
| To show a table's valid data files and each file's metadata, run: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is a good idea to end a paragraph with "run:" and then put two callouts between it and the SQL that it is referencing.
site/docs/spark-queries.md
Outdated
| | 0|s3:.../dt=20210102/xxx.parquet| PARQUET|{20210102}| 14| 2444|{1 -> 94, 2 -> 17}|{1 -> 14, 2 -> 14}| {1 -> 0, 2 -> 0}| {}|{1 -> 1, 2 -> 20210102}|{1 -> 2, 2 -> 20210102}| null| [4]| null| 0| | ||
| | 0|s3:.../dt=20210103/xxx.parquet| PARQUET|{20210103}| 14| 2444|{1 -> 94, 2 -> 17}|{1 -> 14, 2 -> 14}| {1 -> 0, 2 -> 0}| {}|{1 -> 1, 2 -> 20210103}|{1 -> 3, 2 -> 20210103}| null| [4]| null| 0| | ||
| | 0|s3:.../dt=20210104/xxx.parquet| PARQUET|{20210104}| 14| 2444|{1 -> 94, 2 -> 17}|{1 -> 14, 2 -> 14}| {1 -> 0, 2 -> 0}| {}|{1 -> 1, 2 -> 20210104}|{1 -> 3, 2 -> 20210104}| null| [4]| null| 0| | ||
| +-------+------------------------------+-----------+----------+------------+------------------+------------------+------------------+------------------+----------------+-----------------------+-----------------------+------------+-------------+------------+-------------+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix the formatting of this table? You can see my recommendations on how to get a nicely formatted table here: #3422 (comment)
|
site/docs/spark-queries.md
Outdated
| The metadata is readable from any snapshot currently tracked by the table | ||
|
|
||
| !!! WARNING | ||
| The table may return **duplicate** rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "the table"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my mistake, the table's metadata
b5e6fc2 to
369f55c
Compare
site/docs/spark-queries.md
Outdated
| SELECT * FROM prod.db.table.history | ||
| ``` | ||
| ```text | ||
| +-------------------------+---------------------+---------------------+---------------------+ | ||
| | made_current_at | snapshot_id | parent_id | is_current_ancestor | | ||
| +-------------------------+---------------------+---------------------+---------------------+ | ||
|
|
||
| <div class="markdown-table-container" markdown="block"> | ||
| | made_current_at | snapshot_id | parent_id | is_current_ancestor | | ||
| | -- | -- | -- | -- | | ||
| | 2019-02-08 03:29:51.215 | 5781947118336215154 | NULL | true | | ||
| | 2019-02-08 03:47:55.948 | 5179299526185056830 | 5781947118336215154 | true | | ||
| | 2019-02-09 16:24:30.13 | 296410040247533544 | 5179299526185056830 | false | | ||
| | 2019-02-09 16:32:47.336 | 2999875608062437330 | 5179299526185056830 | true | | ||
| | 2019-02-09 19:42:03.919 | 8924558786060583479 | 2999875608062437330 | true | | ||
| | 2019-02-09 19:49:16.343 | 6536733823181975045 | 8924558786060583479 | true | | ||
| +-------------------------+---------------------+---------------------+---------------------+ | ||
| ``` | ||
| </div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szehon-ho here, and I use the css test in my local branch
.markdown-table-container {
width: 780px;
overflow-x: scroll;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out @KnightChess ! If we're going to use the container class for all markdown tables, we can use overflow-x: auto; instead and it will only show the bar when there's overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To prevent a merge conflict, I'll mention this over in PR #3422 so the css can be updated there.
19e3035 to
f0b198c
Compare
site/docs/spark-queries.md
Outdated
| </div> | ||
|
|
||
| !!! Note | ||
| Diffrent from metadata table [files](#files), This table exposes internal details, like files that have been deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this table (lower case)
site/docs/spark-queries.md
Outdated
| </div> | ||
|
|
||
| !!! Note | ||
| Like metadata table [entries](#entries), contain delete data files too, different from [all_data_files](#all-data-files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sentence sounds a bit awkward, what about:
Like metadata table [entries](#entries), this metadata table is different from [all_data_files](#all-data-files) and contains deleted data files.
|
@KnightChess any updates? |
sorry, my fault. There was no change in the email I received so I did not confirm on github. I will do it today |
|
@jackye1995 cc |
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the quick update, looks good to me
site/docs/spark-queries.md
Outdated
| !!! Note | ||
| Like metadata table [entries](#entries), this metadata table is different from [all_data_files](#all-data-files) and contains deleted data files. | ||
|
|
||
| !!! WARNING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be "Warning" rather than all caps.
3814a2b to
aa6570b
Compare
site/docs/spark-queries.md
Outdated
| ### All Metadata | ||
|
|
||
| !!! Note | ||
| The metadata is readable from any snapshot currently tracked by the table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make sense to me.
site/docs/spark-queries.md
Outdated
| | {20211002, 10}| 1| 1| | ||
| </div> | ||
|
|
||
| ### All Metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be "All Metadata Tables" (note plural tables)
And this should have a paragraph explaining that these tables are unions of the metadata tables that are specific to a snapshot.
site/docs/spark-queries.md
Outdated
| +----------------+------------+----------+ | ||
| ``` | ||
|
|
||
| ### Entries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I am late. I personally use this and have pointed others to build on this, because it's currently the only way to track when is the last time data is added, or how much data is added, to a partition.
But that being said, do understand it's a bit too complicated to document in the standard guide here, though just wondering what way advanced users can ever figure this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @KnightChess , if you still have time, could you rebase and address the comments? I would like to move forward on it, I think it's a valuable change.
Agree, I thought of adding this doc few months ago, but because this PR was already there, I was waiting for it to getting merged. @KnightChess : If you don't have time, you can let us know. I can add you as a coauthor and take this PR forward. |
|
@szehon-ho @ajantha-bhat sorry for the late reply, I will take this PR forward in the next few days. |
Awesome. Looking forward to it. |
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @KnightChess for continuing it, replied to some of the comments, and look forward to the changes. I'll be out next week, but can look if i have time, or after that.
b821ae0 to
998f271
Compare
|
@szehon-ho update other except Advanced Usage, and adapt new doc styple for hugo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for delay, I just got back from vacation. Thanks @KnightChess for rebase/fix, and others for confirming the earlier comment.
I made another round of review and some smaller comments, can you please take a look? The change is getting close, and should be good soon. Thanks again.
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @KnightChess I built and found just one broken link in the new doc, should be good after this.
|
Merged, thanks @KnightChess for your patience and contribution |
|
thanks @szehon-ho @rdblue @jackye1995 @samredai a lot for instruct patiently |
|
Thanks @KnightChess! |


#3465
@jackye1995