(AWS) Docs: List all AWS S3 properties from all language impl.#11383
(AWS) Docs: List all AWS S3 properties from all language impl.#11383Neuw84 wants to merge 1 commit intoapache:mainfrom
Conversation
Added Amazon MSK Connect as option. Added HTTP client advice when high throughput scenarios. Added specific configs for data prefetching on EMR 7.1.0
| For versions after 7.1.0 there is an specific config that can be used to enable data prefecth optimization. You just need to add the following property on your Spark config. | ||
|
|
||
| ```shell | ||
| spark.sql.iceberg.data-prefetch.enabled=true |
There was a problem hiding this comment.
I don't believe this is an Iceberg property. If this is specific to EMR, I don't believe it should be included here.
There was a problem hiding this comment.
this is specific to EMR yes ( internal iceberg runtime), however we are on the "aws" docs page.
I think that stating that you can add that parameter to improve the performance of Iceberg workloads on EMR is good to have/know?
| **Note that for workloads with exceptionally high throughput against tables that S3 where you will likely to increase Retries, you will also like to increase the number of connections for the HTTP client** | ||
|
|
||
| ```shell | ||
| spark.sql.catalog.my_catalog.http-client.apache.max-connections=200 |
There was a problem hiding this comment.
This doesn't look like an Iceberg setting from what I can tell. If this is EMR specific, it should not be included here.
There was a problem hiding this comment.
It is a thing of AWS SDK and Spark ( not specifically to EMR). If you use Spark on your laptop writing to S3 and you are on this high throughput write scenario you will likely tune the parameter.
Any spark runtime will use this ( maybe photon runtime do use another S3 client but I don´t have that info :) ).
On the previous case agree with you that is super specific to EMR and it may or not be added on the aws "docs".
I mean, we are speaking about a AWS docs in this page ( the parameter is quite specific to the S3 client of the AWS SDK).
|
@Neuw84 it looks like we're duplicating what should be EMR documentation here. We already link off to the EMR docs, so I don't feel this is the right place for putting specific configuration info. |
|
@danielcweeks let me know your thoughts on the comments ( agree on the specific one about EMR, although for me it does not hurt as we are on aws docs page). What are your thoughts about the S3 clients info? |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
As @hsiang-c made another pull request building a table here I didn't want to collide.
Fixes List all AWS S3 properties in the docs #10674
Therefore, I added:
As personal opinion, if using AWS SDKs most of the properties shouldn't be there ( there is a standard way of configure them, prioritize them, etc). However, is clear that using 3rd party libraries in different languages would require info like the tables @hsiang-c has built.
The problem with this is that different libraries will have different configs ( on the same language).
In my personal opinion instead of dividing by language, maybe by library (but here maybe just adding a link to the corresponding doc page should be enough)? And having a separate section/table for AWS SDKs supported ones (anything using official libraries will have the same config, no matter the language)?
Thanks!