Add export capabilities to MSQ with SQL syntax by adarshsanjeev · Pull Request #15689 · apache/druid

adarshsanjeev · 2024-01-16T05:47:02Z

Problem

Druid currently does not allow export of tables in a programmatic manner. While is is possible to download results from a SELECT query, this relies on writing the results to a single query report, which cannot support large datasets. An export syntax which writes the results in a desired format directly to an external location (such as s3 or hdfs) would be useful.

(INSERT/REPLACE) INTO
EXTERN(<external source function>)
AS <format>
[OVERWRITE ALL]
<select query>

For example: A statement to export all rows from a table into S3 as CSV files would look like

REPLACE INTO 
EXTERN(s3(bucket='bucket1', prefix='export/', tempDir='/var/temp'))
AS CSV
OVERWRITE ALL
SELECT * FROM wikipedia

Initially, only CSV is supported as an export format, but this can be expanded to support other formats easily.

Release note

Adds export statements to MSQ, as a part of INSERT and REPLACE statements. This will allow the results of a query to be written to destination in a configurable format.

Key changed/added classes in this PR

sql/src/main/codegen/includes/common.ftl
sql/src/main/codegen/includes/replace.ftl
IngestHandler

This PR has:

+    if (exportTask.getState().isFailure()) {
+      Assert.fail(StringUtils.format(
+          "Unable to start the task successfully.\nPossible exception: %s",
+          exportTask.getError()


vogievetsky

Thank you for incorporating my feedback

317brian

I made some copyedits to the docs as suggestions. They can either be merged as part of this PR, or I can open a followup PR with the changes.

317brian · 2024-02-06T20:05:38Z

+This variation of EXTERN requires one argument, the details of the destination as specified below.
+This variation additionally requires an `AS` clause to specify the format of the exported rows.


Suggested change

This variation of EXTERN requires one argument, the details of the destination as specified below.

This variation additionally requires an `AS` clause to specify the format of the exported rows.

This variation of EXTERN has two required parts: an argument that details the destination and an `AS` clause to specify the format of the exported rows.

The AS clause would not be an argument to extern, it's present elsewhere in the query. Would it be confusing to call it an argument?

How about the change I just made?

cryptoe

Left some comments around removal of dead code and error messages.

cryptoe · 2024-02-02T09:25:04Z

    } else {
-      throw new ISE("Unsupported destination [%s]", querySpec.getDestination());
+      shuffleSpecFactory = querySpec.getDestination()
+                                    .getShuffleSpecFactory(MultiStageQueryContext.getRowsPerPage(querySpec.getQuery().context()));


Thanks for the refactor. Its much cleaner now.
We should add a comment saying all select partitions are controlled by a context value rowsPerPage.

Do you mean a comment every where the function is being called? We don't pass the whole context to getShuffleSpecFactory(), just the integer, so would this need to be specifically mentioned somewhere?

cryptoe · 2024-02-02T09:28:04Z

+        if (Intervals.ONLY_ETERNITY.equals(exportMSQDestination.getReplaceTimeChunks())) {
+          StorageConnector storageConnector = storageConnectorProvider.get();
+          try {
+            storageConnector.deleteRecursively("");


Also I think code flow wise, make query definition may not be the correct place to delete the file.
Maybe it can be done after we create the query definition object. (Clear files if needed)

cryptoe · 2024-02-07T11:22:08Z

+        return;
+      }
+    }
+    throw DruidException.forPersona(DruidException.Persona.USER)


This error is user facing error. Please mention that the user should reach out to the cluster admin for the paths for export. The paths are controlled via xxx property

Should the error be better addressed for Persona.ADMIN then?

The error message is more likely to be due to user error in specifying the path than a permission issue, so keeping it as user makes sense.

* Add test * Parser changes to support export statements * Fix builds * Address comments * Add frame processor * Address review comments * Fix builds * Update syntax * Webconsole workaround * Refactor * Refactor * Change export file path * Update docs * Remove webconsole changes * Fix spelling mistake * Parser changes, add tests * Parser changes, resolve build warnings * Fix failing test * Fix failing test * Fix IT tests * Add tests * Cleanup * Fix unparse * Fix forbidden API * Update docs * Update docs * Address review comments * Address review comments * Fix tests * Address review comments * Fix insert unparse * Add external write resource action * Fix tests * Add resource check to overlord resource * Fix tests * Add IT * Update syntax * Update tests * Update permission * Address review comments * Address review comments * Address review comments * Add tests * Add check for runtime parameter for bucket and path * Add check for runtime parameter for bucket and path * Add tests * Update docs * Fix NPE * Update docs, remove deadcode * Fix formatting

[Backport] Add export capabilities to MSQ with SQL syntax

@adarshsanjeev

Support for exporting msq results to gcs bucket. This is essentially copying the logic of s3 export for gs, originally done by @adarshsanjeev in this PR - #15689

adarshsanjeev added 2 commits January 15, 2024 12:11

Add test

51639bc

Parser changes to support export statements

158ba0c

github-actions Bot added Area - Batch Ingestion Area - Querying Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Jan 16, 2024

cryptoe requested a review from LakshSingla January 16, 2024 06:14

github-advanced-security AI found potential problems Jan 16, 2024

View reviewed changes

Comment thread sql/src/main/java/org/apache/druid/sql/calcite/parser/ExternalDestinationSqlIdentifier.java Fixed

gargvishesh reviewed Jan 16, 2024

View reviewed changes

Comment thread extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/sql/MSQTaskSqlEngine.java Outdated

Comment thread server/src/main/java/org/apache/druid/catalog/model/table/export/ExportDestination.java Outdated

adarshsanjeev added 2 commits January 16, 2024 20:32

Fix builds

9e1ef17

Address comments

06e9d92

cryptoe added this to the Druid 29.0.0 milestone Jan 17, 2024

LakshSingla reviewed Jan 18, 2024

View reviewed changes

adarshsanjeev added 2 commits January 21, 2024 18:16

Add frame processor

6ff6747

Address review comments

53ff841

github-advanced-security AI found potential problems Jan 22, 2024

View reviewed changes

Comment thread sql/src/main/java/org/apache/druid/sql/calcite/parser/ExternalDestinationSqlIdentifier.java Fixed

Fix builds

6daf530

adarshsanjeev marked this pull request as ready for review January 23, 2024 05:30

adarshsanjeev added the Needs web console change Backend API changes that would benefit from frontend support in the web console label Jan 23, 2024

Update syntax

550cc8f

github-advanced-security AI found potential problems Jan 23, 2024

View reviewed changes

Comment thread ...re/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3StorageExportConfig.java Fixed

Comment thread ...re/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3StorageExportConfig.java Fixed

Webconsole workaround

9ab7b37

github-actions Bot added the Area - Web Console label Jan 23, 2024

adarshsanjeev added 4 commits January 24, 2024 11:38

Refactor

5fd7ded

Refactor

e6c75ab

Change export file path

4c9d4cc

Merge remote-tracking branch 'origin/master' into export-syntax

529d14b

github-advanced-security AI found potential problems Jan 24, 2024

View reviewed changes

Comment thread ...nsions/src/main/java/org/apache/druid/storage/s3/output/S3ExportStorageConnectorFactory.java Fixed

Comment thread ...nsions/src/main/java/org/apache/druid/storage/s3/output/S3ExportStorageConnectorFactory.java Fixed

Update docs

58f1d13

github-actions Bot added the Area - Documentation label Jan 25, 2024

github-advanced-security AI found potential problems Jan 25, 2024

View reviewed changes

Comment thread web-console/src/views/workbench-view/query-tab/query-tab.tsx Fixed

adarshsanjeev added 3 commits February 2, 2024 16:27

Update tests

8e21576

Update permission

62c2c04

Address review comments

2c9e87b

github-advanced-security AI found potential problems Feb 4, 2024

View reviewed changes

Address review comments

c79c496

github-advanced-security AI found potential problems Feb 5, 2024

View reviewed changes

Comment thread processing/src/main/java/org/apache/druid/storage/local/LocalFileExportStorageProvider.java Fixed

adarshsanjeev added 3 commits February 5, 2024 18:22

Address review comments

81ee2a3

Add tests

f9873a6

Merge remote-tracking branch 'origin/master' into export-syntax

b4a2223

317brian mentioned this pull request Feb 5, 2024

[Docs] Druid 29.0.0 release notes #15805

Merged

adarshsanjeev added 3 commits February 6, 2024 17:03

Add check for runtime parameter for bucket and path

c7f8234

Add check for runtime parameter for bucket and path

cf15323

Add tests

2ff3410

vogievetsky approved these changes Feb 6, 2024

View reviewed changes

317brian reviewed Feb 6, 2024

View reviewed changes

Comment thread docs/multi-stage-query/reference.md Outdated

adarshsanjeev added 2 commits February 7, 2024 08:49

Update docs

c71cc5a

Fix NPE

180f132

cryptoe approved these changes Feb 7, 2024

View reviewed changes

Update docs, remove deadcode

5206e90

github-advanced-security AI found potential problems Feb 7, 2024

View reviewed changes

Comment thread processing/src/main/java/org/apache/druid/storage/local/LocalFileExportStorageProvider.java Fixed

Fix formatting

10217a5

cryptoe merged commit 514b3b4 into apache:master Feb 7, 2024

abhishekagarwal87 pushed a commit that referenced this pull request Feb 8, 2024

Add export capabilities to MSQ with SQL syntax (#15689) (#15862)

bf8b981

[Backport] Add export capabilities to MSQ with SQL syntax

LakshSingla mentioned this pull request Feb 13, 2024

[DRAFT] 29.0.0 release notes #15896

Closed

adarshsanjeev mentioned this pull request Feb 29, 2024

Revert explain attributes change #16004

Merged

10 tasks

pjain1 mentioned this pull request Mar 6, 2024

add google as external storage for msq export #16051

Merged

10 tasks

cryptoe mentioned this pull request Mar 8, 2024

Add feature for MSQ to output payload in CSV (And other formats) currently supported by Druid SQL API #13908

Closed

		This variation of EXTERN requires one argument, the details of the destination as specified below.
		This variation additionally requires an `AS` clause to specify the format of the exported rows.

	This variation of EXTERN requires one argument, the details of the destination as specified below.
	This variation additionally requires an `AS` clause to specify the format of the exported rows.
	This variation of EXTERN has two required parts: an argument that details the destination and an `AS` clause to specify the format of the exported rows.

Conversation

adarshsanjeev commented Jan 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Release note

Key changed/added classes in this PR

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check notice

Uh oh!

vogievetsky left a comment

Choose a reason for hiding this comment

Uh oh!

317brian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

317brian Feb 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adarshsanjeev Feb 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

317brian Feb 7, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cryptoe left a comment

Choose a reason for hiding this comment

Uh oh!

cryptoe Feb 2, 2024

Choose a reason for hiding this comment

Uh oh!

adarshsanjeev Feb 7, 2024

Choose a reason for hiding this comment

Uh oh!

cryptoe Feb 2, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cryptoe Feb 7, 2024

Choose a reason for hiding this comment

Uh oh!

LakshSingla Feb 7, 2024

Choose a reason for hiding this comment

Uh oh!

adarshsanjeev Feb 7, 2024

Choose a reason for hiding this comment

Uh oh!

adarshsanjeev commented Jan 16, 2024 •

edited

Loading

317brian Feb 6, 2024 •

edited

Loading

adarshsanjeev Feb 7, 2024 •

edited

Loading