Add export capabilities to MSQ with SQL syntax#15689
Conversation
| if (exportTask.getState().isFailure()) { | ||
| Assert.fail(StringUtils.format( | ||
| "Unable to start the task successfully.\nPossible exception: %s", | ||
| exportTask.getError() |
Check notice
Code scanning / CodeQL
Use of default toString()
vogievetsky
left a comment
There was a problem hiding this comment.
Thank you for incorporating my feedback
317brian
left a comment
There was a problem hiding this comment.
I made some copyedits to the docs as suggestions. They can either be merged as part of this PR, or I can open a followup PR with the changes.
| This variation of EXTERN requires one argument, the details of the destination as specified below. | ||
| This variation additionally requires an `AS` clause to specify the format of the exported rows. |
There was a problem hiding this comment.
| This variation of EXTERN requires one argument, the details of the destination as specified below. | |
| This variation additionally requires an `AS` clause to specify the format of the exported rows. | |
| This variation of EXTERN has two required parts: an argument that details the destination and an `AS` clause to specify the format of the exported rows. |
There was a problem hiding this comment.
The AS clause would not be an argument to extern, it's present elsewhere in the query. Would it be confusing to call it an argument?
There was a problem hiding this comment.
How about the change I just made?
cryptoe
left a comment
There was a problem hiding this comment.
Left some comments around removal of dead code and error messages.
| } else { | ||
| throw new ISE("Unsupported destination [%s]", querySpec.getDestination()); | ||
| shuffleSpecFactory = querySpec.getDestination() | ||
| .getShuffleSpecFactory(MultiStageQueryContext.getRowsPerPage(querySpec.getQuery().context())); |
There was a problem hiding this comment.
Thanks for the refactor. Its much cleaner now.
We should add a comment saying all select partitions are controlled by a context value rowsPerPage.
There was a problem hiding this comment.
Do you mean a comment every where the function is being called? We don't pass the whole context to getShuffleSpecFactory(), just the integer, so would this need to be specifically mentioned somewhere?
| if (Intervals.ONLY_ETERNITY.equals(exportMSQDestination.getReplaceTimeChunks())) { | ||
| StorageConnector storageConnector = storageConnectorProvider.get(); | ||
| try { | ||
| storageConnector.deleteRecursively(""); |
There was a problem hiding this comment.
Also I think code flow wise, make query definition may not be the correct place to delete the file.
Maybe it can be done after we create the query definition object. (Clear files if needed)
| return; | ||
| } | ||
| } | ||
| throw DruidException.forPersona(DruidException.Persona.USER) |
There was a problem hiding this comment.
This error is user facing error. Please mention that the user should reach out to the cluster admin for the paths for export. The paths are controlled via xxx property
There was a problem hiding this comment.
Should the error be better addressed for Persona.ADMIN then?
There was a problem hiding this comment.
The error message is more likely to be due to user error in specifying the path than a permission issue, so keeping it as user makes sense.
* Add test * Parser changes to support export statements * Fix builds * Address comments * Add frame processor * Address review comments * Fix builds * Update syntax * Webconsole workaround * Refactor * Refactor * Change export file path * Update docs * Remove webconsole changes * Fix spelling mistake * Parser changes, add tests * Parser changes, resolve build warnings * Fix failing test * Fix failing test * Fix IT tests * Add tests * Cleanup * Fix unparse * Fix forbidden API * Update docs * Update docs * Address review comments * Address review comments * Fix tests * Address review comments * Fix insert unparse * Add external write resource action * Fix tests * Add resource check to overlord resource * Fix tests * Add IT * Update syntax * Update tests * Update permission * Address review comments * Address review comments * Address review comments * Add tests * Add check for runtime parameter for bucket and path * Add check for runtime parameter for bucket and path * Add tests * Update docs * Fix NPE * Update docs, remove deadcode * Fix formatting
Support for exporting msq results to gcs bucket. This is essentially copying the logic of s3 export for gs, originally done by @adarshsanjeev in this PR - #15689
Problem
Druid currently does not allow export of tables in a programmatic manner. While is is possible to download results from a SELECT query, this relies on writing the results to a single query report, which cannot support large datasets. An export syntax which writes the results in a desired format directly to an external location (such as s3 or hdfs) would be useful.
For example: A statement to export all rows from a table into S3 as CSV files would look like
Initially, only CSV is supported as an export format, but this can be expanded to support other formats easily.
Release note
Key changed/added classes in this PR
sql/src/main/codegen/includes/common.ftlsql/src/main/codegen/includes/replace.ftlIngestHandlerThis PR has: