Add reasoning for choosing shardSpec to the MSQ report#16175
Add reasoning for choosing shardSpec to the MSQ report#16175cryptoe merged 7 commits intoapache:masterfrom
Conversation
| { | ||
| if (mayHaveMultiValuedClusterByFields) { | ||
| // DimensionRangeShardSpec cannot handle multi-valued fields. | ||
| return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, the fields in the CLUSTER BY clause contains a multivalues. Using NumberedShardSpec instead."); |
There was a problem hiding this comment.
nit: grammar
Also, if its possible to pinpoint the multiValue fields without much refactoring, then we can mention that here.
| return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, the fields in the CLUSTER BY clause contains a multivalues. Using NumberedShardSpec instead."); | |
| return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, the fields in the CLUSTERED BY clause contains multivalues in column [%s]. Using NumberedShardSpec instead."); |
There was a problem hiding this comment.
I don't think we have the column name at this point, we only store a boolean mayContainMultivalues. Updated the message a bit
| // DimensionRangeShardSpec only handles columns that appear as-is in the output. | ||
| if (outputColumns.isEmpty()) { | ||
| return Collections.emptyList(); | ||
| return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, RangeShardSpec only supports columns that appear as-is in the output. Using NumberedShardSpec instead."); |
There was a problem hiding this comment.
Changed the message to "Could not find output column name for column [%s]" to include the column name. I'm not sure what conditions would cause the output column to not be found here.
|
|
||
| if (numShardColumns == 0) { | ||
| return Collections.emptyList(); | ||
| return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, as there are no shardColumns. Using NumberedShardSpec instead."); |
There was a problem hiding this comment.
What happens if the user doesn't supply the clustered by. In that case, the reason doesn't seem necessary, or it can be reworded.
| return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, as there are no shardColumns. Using NumberedShardSpec instead."); | |
| return Pair.of(Collections.emptyList(), "Using NumberedShardSpec as no columns are supplied in the 'CLUSTERED BY' clause."); |
|
I missed it before, but we should also add MSQTests where these cases are getting tripped, and assert the reason in the report. |
|
cc @vogievetsky for the web console changes. |
MSQ chooses the shard spec based on certain criteria. However, this criteria is not very transparent to the user. The only way to find the shard spec which was chosen is to search for a segment in the segment UI after the ingestion is finished.
This PR logs the segment type and reason chosen. It also adds it to the query report, to be displayed in the UI.
This PR adds a new section to the reports,
segmentReport. This contains the segment type created, if the query is an ingestion, and null otherwise.The shardSpec mentions the shardSpec type generated. MSQ prefers to use RangedShardSpec when possible. For inserts and replace queries, the default shard spec is NumberedShardSpec and DimensionRangeShardSpec respectively. If a ranged shard spec cannot be chosen for the replace query, the details field will contain the reason why it could not be used.
This PR has: