msq: add multi-stage-query docs#12983
Conversation
add back theta sketches tutoria change filename fix filename fix link fix headings
573436e to
748ab34
Compare
|
@2bethere @gianm @vogievetsky please review the OSS docs as a whole for MSQ @cryptoe please review the known issues list to see if there's anything to be added or removed: https://github.com/apache/druid/pull/12983/files#diff-e83f8a116cd5e34642cfe3474e915dd586d4679547ac393a6dea654a78c9bbec |
|
|
||
| ### PARTITIONED BY | ||
|
|
||
| INSERT and REPLACE queries require the PARTITIONED BY clause, which determines how time-based partitioning is done. In Druid, data is split into segments, one or more per time chunk defined by the PARTITIONED BY granularity. A good general rule is to adjust the granularity so that each segment contains about five million rows. Choose a granularity based on your ingestion rate. For example, if you ingest a million rows per day, PARTITION BY DAY is good. If you ingest a million rows an hour, choose PARTITION BY HOUR instead. |
There was a problem hiding this comment.
Should we make 5 million bold ?
| - EXTERN does not accept `druid` input sources. | ||
|
|
||
| ## Missing guardrails | ||
|
|
There was a problem hiding this comment.
Maximum number of input files. No guardrail today means the controller can potentially run out of memory tracking them all.
|
|
||
| > SQL-based ingestion using the multi-stage query task engine is our recommended solution starting in Druid 24.0. Alternative ingestion solutions, such as native batch and Hadoop-based ingestion systems, will still be supported. We recommend you read all [known issues](./msq-known-issues.md) and test the feature in a development environment before rolling it out in production. Using the multi-stage query task engine with `SELECT` statements that do not write to a datasource is experimental. | ||
|
|
||
| By default, the multi-stage query task engine (MSQ task engine) uses the local storage of a node to store data from intermediate steps when executing a query. Although this method provides better speed when executing a query, the data is lost if the node encounters an issue. When you enable durable storage, intermediate data is stored in Amazon S3 instead. Using this feature can improve the reliability of queries that use more than 20 workers. In essence, you trade some performance for better reliability. This is especially useful for long-running queries. |
There was a problem hiding this comment.
Where does the "20 workers" language come from?
| 3. A row signature, as a JSON-encoded array of column descriptors. Each column descriptor must have a `name` and a `type`. The type can be `string`, `long`, `double`, or `float`. This row signature is used to map the external data into the SQL layer. | ||
|
|
||
| ### INSERT | ||
|
|
There was a problem hiding this comment.
I think somewhere here we should call out a note about how the MSQ INSERT syntax (REPLACE also, but this has more impact for INSERT) deviates from the SQL standard in that the column are mapped by name and not positionally.
Maybe something like:
Please note that unlike standard SQL the data is inserted according to column name and not positionally which means that it is important to get the output column names of subsequent inserts to be the same as the table and not to simply reply on their positions within the SELECT clause.
|
|
||
| |Field|Description| | ||
| |-----|-----------| | ||
| | taskId | Controller task ID. You can use Druid's standard [task APIs](../operations/api-reference.md#overlord) to interact with this controller task.| |
There was a problem hiding this comment.
is there a strange indent here or is the GitHub UI buggy
|
|
||
| Currently, the MSQ task engine ignores the provided values of `resultFormat`, `header`, | ||
| `typesHeader`, and `sqlTypesHeader`. SQL SELECT queries always behave as if `resultFormat` is an array, `header` is | ||
| true, `typesHeader` is true, and `sqlTypesHeader` is true. |
There was a problem hiding this comment.
I would change
SQL SELECT queries always behave as if `resultFormat` is an array, `header` is
true, `typesHeader` is true, and `sqlTypesHeader` is true.
To
SQL SELECT queries write out their results into the task report (in the `multiStageQuery.payload.results.results` key) formatted as if `resultFormat` is an `array`.
finish sentence
| To interact with a query through the Overlord API, you need the following permissions: | ||
|
|
||
| - INSERT or REPLACE queries: You must have READ DATASOURCE permission on the output datasource. | ||
| - SELECT queries: You must have read permissions on the `__query_select` datasource, which is a stub datasource that gets created. |
There was a problem hiding this comment.
This should read ... that does not get created.
| "ingestion/tasks", | ||
| "ingestion/faq" | ||
| ], | ||
| "SQL-based ingestion": [ |
There was a problem hiding this comment.
this needs two spaces of leading indent
|
@317brian - You need to add the copyright header in text files. |
* msq: add multi-stage-query docs * add screenshots add back theta sketches tutoria change filename fix filename fix link fix headings * fixes * fixes * fix spelling issues and update spell file * address feedback from karan * add missing guardrail to known issues * update blurb * fix typo * remove durable storage info * update titles * Restore en.json * Update query view * address comments from vad * Update docs/multi-stage-query/msq-known-issues.md finish sentence * add apache license to docs * add apache license to docs Co-authored-by: Katya Macedo <katya.macedo@imply.io> Co-authored-by: Charles Smith <techdocsmith@gmail.com>

This PR adds all the documentation updates for the multi-stage query architecture and the MSQ task engine
Note that the screenshot on the Druid console page is outdated and will get updated in a subsequent PR.
This PR has: