From fc582b873886f2de27ac4ad547c9c78ec79ffc76 Mon Sep 17 00:00:00 2001 From: Steve Hetland Date: Thu, 22 Oct 2020 16:47:59 -0700 Subject: [PATCH 1/5] cleaning up and fixing links --- docs/configuration/index.md | 44 +++---- docs/configuration/logging.md | 2 +- docs/dependencies/metadata-storage.md | 2 +- docs/design/architecture.md | 12 +- docs/design/broker.md | 4 +- docs/design/coordinator.md | 12 +- docs/design/historical.md | 4 +- docs/design/index.md | 4 +- docs/design/indexer.md | 8 +- docs/design/indexing-service.md | 2 +- docs/design/middlemanager.md | 4 +- docs/design/overlord.md | 6 +- docs/design/peons.md | 4 +- docs/design/processes.md | 2 +- docs/design/router.md | 4 +- .../extensions-contrib/materialized-view.md | 2 +- .../moving-average-query.md | 4 +- .../extensions-contrib/redis-cache.md | 6 +- docs/development/extensions-contrib/statsd.md | 2 +- .../extensions-core/bloom-filter.md | 2 +- .../extensions-core/datasketches-extension.md | 8 +- .../extensions-core/datasketches-quantiles.md | 4 +- .../extensions-core/datasketches-theta.md | 2 +- .../extensions-core/datasketches-tuple.md | 6 +- .../extensions-core/druid-basic-security.md | 4 +- .../extensions-core/druid-pac4j.md | 2 +- .../extensions-core/druid-ranger-security.md | 4 +- docs/development/extensions-core/hdfs.md | 6 +- .../extensions-core/kafka-ingestion.md | 6 +- .../extensions-core/kinesis-ingestion.md | 6 +- .../extensions-core/lookups-cached-global.md | 4 +- docs/development/extensions-core/s3.md | 10 +- .../simple-client-sslcontext.md | 6 +- docs/development/javascript.md | 16 +-- docs/development/overview.md | 2 +- docs/ingestion/data-formats.md | 8 +- docs/ingestion/data-management.md | 12 +- docs/ingestion/hadoop.md | 12 +- docs/ingestion/index.md | 30 ++--- docs/ingestion/native-batch.md | 4 +- docs/ingestion/schema-design.md | 12 +- docs/ingestion/standalone-realtime.md | 4 +- docs/ingestion/tasks.md | 8 +- docs/misc/math-expr.md | 8 +- docs/operations/api-reference.md | 20 ++-- docs/operations/basic-cluster-tuning.md | 2 +- docs/operations/deep-storage-migration.md | 6 +- docs/operations/druid-console.md | 6 +- docs/operations/high-availability.md | 4 +- docs/operations/metadata-migration.md | 2 +- docs/operations/metrics.md | 6 +- docs/operations/other-hadoop.md | 2 +- docs/operations/segment-optimization.md | 8 +- docs/operations/single-server.md | 2 +- docs/operations/tls-support.md | 6 +- docs/querying/aggregations.md | 2 +- docs/querying/caching.md | 16 +-- docs/querying/datasource.md | 24 ++-- docs/querying/dimensionspecs.md | 10 +- docs/querying/filters.md | 6 +- docs/querying/groupbyquery.md | 16 +-- docs/querying/having.md | 2 +- docs/querying/lookups.md | 4 +- docs/querying/multi-value-dimensions.md | 16 +-- docs/querying/multitenancy.md | 2 +- docs/querying/query-context.md | 16 +-- docs/querying/querying.md | 3 +- docs/querying/sorting-orders.md | 2 +- docs/querying/sql.md | 108 +++++++++--------- docs/querying/topnquery.md | 4 +- docs/tutorials/cluster.md | 4 +- docs/tutorials/tutorial-batch-hadoop.md | 4 +- docs/tutorials/tutorial-compaction.md | 2 +- docs/tutorials/tutorial-delete-data.md | 4 +- docs/tutorials/tutorial-ingestion-spec.md | 2 +- docs/tutorials/tutorial-kafka.md | 4 +- docs/tutorials/tutorial-retention.md | 2 +- docs/tutorials/tutorial-rollup.md | 2 +- docs/tutorials/tutorial-transform-spec.md | 2 +- docs/tutorials/tutorial-update-data.md | 2 +- 80 files changed, 307 insertions(+), 330 deletions(-) diff --git a/docs/configuration/index.md b/docs/configuration/index.md index 756debda0022..5893d3401f51 100644 --- a/docs/configuration/index.md +++ b/docs/configuration/index.md @@ -69,7 +69,7 @@ The properties under this section are common configurations that should be share There are four JVM parameters that we set on all of our processes: -1. `-Duser.timezone=UTC` This sets the default timezone of the JVM to UTC. We always set this and do not test with other default timezones, so local timezones might work, but they also might uncover weird and interesting bugs. To issue queries in a non-UTC timezone, see [query granularities](../querying/granularities.html#period-granularities) +1. `-Duser.timezone=UTC` This sets the default timezone of the JVM to UTC. We always set this and do not test with other default timezones, so local timezones might work, but they also might uncover weird and interesting bugs. To issue queries in a non-UTC timezone, see [query granularities](../querying/granularities.md#period-granularities) 2. `-Dfile.encoding=UTF-8` This is similar to timezone, we test assuming UTF-8. Local encodings might work, but they also might result in weird and interesting bugs. 3. `-Djava.io.tmpdir=` Various parts of the system that interact with the file system do it via temporary files, and these files can get somewhat large. Many production systems are set up to have small (but fast) `/tmp` directories, which can be problematic with Druid so we recommend pointing the JVM’s tmp directory to something with a little more meat. This directory should not be volatile tmpfs. This directory should also have good read and write speed and hence NFS mount should strongly be avoided. 4. `-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager` This allows log4j2 to handle logs for non-log4j2 components (like jetty) which use standard java logging. @@ -177,9 +177,9 @@ and `druid.tlsPort` properties on each process. Please see `Configuration` secti #### Jetty Server TLS Configuration Druid uses Jetty as an embedded web server. To get familiar with TLS/SSL in general and related concepts like Certificates etc. -reading this [Jetty documentation](http://www.eclipse.org/jetty/documentation/9.4.x/configuring-ssl.html) might be helpful. +reading this [Jetty documentation](http://www.eclipse.org/jetty/documentation/9.4.32.v20200930/configuring-ssl.html) might be helpful. To get more in depth knowledge of TLS/SSL support in Java in general, please refer to this [guide](http://docs.oracle.com/javase/8/docs/technotes/guides/security/jsse/JSSERefGuide.html). -The documentation [here](http://www.eclipse.org/jetty/documentation/9.4.x/configuring-ssl.html#configuring-sslcontextfactory) +The documentation [here](http://www.eclipse.org/jetty/documentation/9.4.32.v20200930/configuring-ssl.html#configuring-sslcontextfactory) can help in understanding TLS/SSL configurations listed below. This [document](http://docs.oracle.com/javase/8/docs/technotes/guides/security/StandardNames.html) lists all the possible values for the below mentioned configs among others provided by Java implementation. @@ -312,7 +312,7 @@ For native query, only request logs where query/time is above the threshold are |--------|-----------|-------| |`druid.request.logging.queryTimeThresholdMs`|Threshold value for query/time in milliseconds.|0, i.e., no filtering| |`druid.request.logging.sqlQueryTimeThresholdMs`|Threshold value for sqlQuery/time in milliseconds.|0, i.e., no filtering| -|`druid.request.logging.mutedQueryTypes` | Query requests of these types are not logged. Query types are defined as string objects corresponding to the "queryType" value for the specified query in the Druid's [native JSON query API](http://druid.apache.org/docs/latest/querying/querying.html). Misspelled query types will be ignored. Example to ignore scan and timeBoundary queries: ["scan", "timeBoundary"]| []| +|`druid.request.logging.mutedQueryTypes` | Query requests of these types are not logged. Query types are defined as string objects corresponding to the "queryType" value for the specified query in the Druid's [native JSON query API](http://druid.apache.org/docs/latest/querying/querying). Misspelled query types will be ignored. Example to ignore scan and timeBoundary queries: ["scan", "timeBoundary"]| []| |`druid.request.logging.delegate.type`|Type of delegate request logger to log requests.|none| #### Composite Request Logging @@ -398,7 +398,7 @@ The Druid servers [emit various metrics](../operations/metrics.md) and alerts vi #### Http Emitter Module TLS Overrides When emitting events to a TLS-enabled receiver, the Http Emitter will by default use an SSLContext obtained via the -process described at [Druid's internal communication over TLS](../operations/tls-support.html), i.e., the same +process described at [Druid's internal communication over TLS](../operations/tls-support.md), i.e., the same SSLContext that would be used for internal communications between Druid processes. In some use cases it may be desirable to have the Http Emitter use its own separate truststore configuration. For example, there may be organizational policies that prevent the TLS-enabled metrics receiver's certificate from being added to the same truststore used by Druid's internal HTTP client. @@ -489,10 +489,10 @@ The below table shows some important configurations for S3. See [S3 Deep Storage |--------|-----------|-------| |`druid.storage.bucket`|S3 bucket name.|none| |`druid.storage.baseKey`|S3 object key prefix for storage.|none| -|`druid.storage.disableAcl`|Boolean flag for ACL. If this is set to `false`, the full control would be granted to the bucket owner. This may require to set additional permissions. See [S3 permissions settings](../development/extensions-core/s3.html#s3-permissions-settings).|false| +|`druid.storage.disableAcl`|Boolean flag for ACL. If this is set to `false`, the full control would be granted to the bucket owner. This may require to set additional permissions. See [S3 permissions settings](../development/extensions-core/s3.md#s3-permissions-settings).|false| |`druid.storage.archiveBucket`|S3 bucket name for archiving when running the *archive task*.|none| |`druid.storage.archiveBaseKey`|S3 object key prefix for archiving.|none| -|`druid.storage.sse.type`|Server-side encryption type. Should be one of `s3`, `kms`, and `custom`. See the below [Server-side encryption section](../development/extensions-core/s3.html#server-side-encryption) for more details.|None| +|`druid.storage.sse.type`|Server-side encryption type. Should be one of `s3`, `kms`, and `custom`. See the below [Server-side encryption section](../development/extensions-core/s3.md#server-side-encryption) for more details.|None| |`druid.storage.sse.kms.keyId`|AWS KMS key ID. This is used only when `druid.storage.sse.type` is `kms` and can be empty to use the default key ID.|None| |`druid.storage.sse.custom.base64EncodedKey`|Base64-encoded key. Should be specified if `druid.storage.sse.type` is `custom`.|None| |`druid.storage.useS3aSchema`|If true, use the "s3a" filesystem when using Hadoop-based ingestion. If false, the "s3n" filesystem will be used. Only affects Hadoop-based ingestion.|false| @@ -660,7 +660,7 @@ All Druid components can communicate with each other over HTTP. ## Master Server -This section contains the configuration options for the processes that reside on Master servers (Coordinators and Overlords) in the suggested [three-server configuration](../design/processes.html#server-types). +This section contains the configuration options for the processes that reside on Master servers (Coordinators and Overlords) in the suggested [three-server configuration](../design/processes.md#server-types). ### Coordinator @@ -806,7 +806,7 @@ These configuration options control the behavior of the Lookup dynamic configura ##### Compaction Dynamic Configuration Compaction configurations can also be set or updated dynamically using -[Coordinator's API](../operations/api-reference.html#compaction-configuration) without restarting Coordinators. +[Coordinator's API](../operations/api-reference.md#compaction-configuration) without restarting Coordinators. For details about segment compaction, please check [Segment Size Optimization](../operations/segment-optimization.md). @@ -815,12 +815,12 @@ A description of the compaction config is: |Property|Description|Required| |--------|-----------|--------| |`dataSource`|dataSource name to be compacted.|yes| -|`taskPriority`|[Priority](../ingestion/tasks.html#priority) of compaction task.|no (default = 25)| +|`taskPriority`|[Priority](../ingestion/tasks.md#priority) of compaction task.|no (default = 25)| |`inputSegmentSizeBytes`|Maximum number of total segment bytes processed per compaction task. Since a time chunk must be processed in its entirety, if the segments for a particular time chunk have a total size in bytes greater than this parameter, compaction will not run for that time chunk. Because each compaction task runs with a single thread, setting this value too far above 1–2GB will result in compaction tasks taking an excessive amount of time.|no (default = 419430400)| |`maxRowsPerSegment`|Max number of rows per segment after compaction.|no| |`skipOffsetFromLatest`|The offset for searching segments to be compacted. Strongly recommended to set for realtime dataSources. |no (default = "P1D")| |`tuningConfig`|Tuning config for compaction tasks. See below [Compaction Task TuningConfig](#compaction-tuningconfig).|no| -|`taskContext`|[Task context](../ingestion/tasks.html#context) for compaction tasks.|no| +|`taskContext`|[Task context](../ingestion/tasks.md#context) for compaction tasks.|no| An example of compaction config is: @@ -893,7 +893,7 @@ These Overlord static configurations can be defined in the `overlord/runtime.pro |`druid.indexer.queue.restartDelay`|Sleep this long when Overlord queue management throws an exception before trying again.|PT30S| |`druid.indexer.queue.storageSyncRate`|Sync Overlord state this often with an underlying task persistence mechanism.|PT1M| -The following configs only apply if the Overlord is running in remote mode. For a description of local vs. remote mode, please see (../design/overlord.html). +The following configs only apply if the Overlord is running in remote mode. For a description of local vs. remote mode, please see (../design/overlord.md). |Property|Description|Default| |--------|-----------|-------| @@ -1153,7 +1153,7 @@ For GCE's properties, please refer to the [gce-extensions](../development/extens ## Data Server -This section contains the configuration options for the processes that reside on Data servers (MiddleManagers/Peons and Historicals) in the suggested [three-server configuration](../design/processes.html#server-types). +This section contains the configuration options for the processes that reside on Data servers (MiddleManagers/Peons and Historicals) in the suggested [three-server configuration](../design/processes.md#server-types). Configuration options for the experimental [Indexer process](../design/indexer.md) are also provided here. @@ -1323,14 +1323,14 @@ Druid uses Jetty to serve HTTP requests. |Property|Description|Default| |--------|-----------|-------| -|`druid.server.http.numThreads`|Number of threads for HTTP requests. Please see the [Indexer Server HTTP threads](../design/indexer.html#server-http-threads) documentation for more details on how the Indexer uses this configuration.|max(10, (Number of cores * 17) / 16 + 2) + 30| +|`druid.server.http.numThreads`|Number of threads for HTTP requests. Please see the [Indexer Server HTTP threads](../design/indexer.md#server-http-threads) documentation for more details on how the Indexer uses this configuration.|max(10, (Number of cores * 17) / 16 + 2) + 30| |`druid.server.http.queueSize`|Size of the worker queue used by Jetty server to temporarily store incoming client connections. If this value is set and a request is rejected by jetty because queue is full then client would observe request failure with TCP connection being closed immediately with a completely empty response from server.|Unbounded| |`druid.server.http.maxIdleTime`|The Jetty max idle time for a connection.|PT5M| |`druid.server.http.enableRequestLimit`|If enabled, no requests would be queued in jetty queue and "HTTP 429 Too Many Requests" error response would be sent. |false| |`druid.server.http.defaultQueryTimeout`|Query timeout in millis, beyond which unfinished queries will be cancelled|300000| |`druid.server.http.gracefulShutdownTimeout`|The maximum amount of time Jetty waits after receiving shutdown signal. After this timeout the threads will be forcefully shutdown. This allows any queries that are executing to complete.|`PT0S` (do not wait)| |`druid.server.http.unannouncePropagationDelay`|How long to wait for zookeeper unannouncements to propagate before shutting down Jetty. This is a minimum and `druid.server.http.gracefulShutdownTimeout` does not start counting down until after this period elapses.|`PT0S` (do not wait)| -|`druid.server.http.maxQueryTimeout`|Maximum allowed value (in milliseconds) for `timeout` parameter. See [query-context](../querying/query-context.html) to know more about `timeout`. Query is rejected if the query context `timeout` is greater than this value. |Long.MAX_VALUE| +|`druid.server.http.maxQueryTimeout`|Maximum allowed value (in milliseconds) for `timeout` parameter. See [query-context](../querying/query-context.md) to know more about `timeout`. Query is rejected if the query context `timeout` is greater than this value. |Long.MAX_VALUE| |`druid.server.http.maxRequestHeaderSize`|Maximum size of a request header in bytes. Larger headers consume more memory and can make a server more vulnerable to denial of service attacks.|8 * 1024| |`druid.server.http.enableForwardedRequestCustomizer`|If enabled, adds Jetty ForwardedRequestCustomizer which reads X-Forwarded-* request headers to manipulate servlet request object when Druid is used behind a proxy.|false| |`druid.server.http.allowedHttpMethods`|List of HTTP methods that should be allowed in addition to the ones required by Druid APIs. Druid APIs require GET, PUT, POST, and DELETE, which are always allowed. This option is not useful unless you have installed an extension that needs these additional HTTP methods or that adds functionality related to CORS. None of Druid's bundled extensions require these methods.|[]| @@ -1478,7 +1478,7 @@ See [cache configuration](#cache-configuration) for how to configure cache setti ## Query Server -This section contains the configuration options for the processes that reside on Query servers (Brokers) in the suggested [three-server configuration](../design/processes.html#server-types). +This section contains the configuration options for the processes that reside on Query servers (Brokers) in the suggested [three-server configuration](../design/processes.md#server-types). Configuration options for the experimental [Router process](../design/router.md) are also provided here. @@ -1649,7 +1649,7 @@ The Druid SQL server is configured through the following properties on the Broke |`druid.sql.planner.maxTopNLimit`|Maximum threshold for a [TopN query](../querying/topnquery.md). Higher limits will be planned as [GroupBy queries](../querying/groupbyquery.md) instead.|100000| |`druid.sql.planner.metadataRefreshPeriod`|Throttle for metadata refreshes.|PT1M| |`druid.sql.planner.useApproximateCountDistinct`|Whether to use an approximate cardinality algorithm for `COUNT(DISTINCT foo)`.|true| -|`druid.sql.planner.useApproximateTopN`|Whether to use approximate [TopN queries](../querying/topnquery.html) when a SQL query could be expressed as such. If false, exact [GroupBy queries](../querying/groupbyquery.html) will be used instead.|true| +|`druid.sql.planner.useApproximateTopN`|Whether to use approximate [TopN queries](../querying/topnquery.md) when a SQL query could be expressed as such. If false, exact [GroupBy queries](../querying/groupbyquery.md) will be used instead.|true| |`druid.sql.planner.requireTimeCondition`|Whether to require SQL to have filter conditions on __time column so that all generated native queries will have user specified intervals. If true, all queries without filter condition on __time column will fail|false| |`druid.sql.planner.sqlTimeZone`|Sets the default time zone for the server, which will affect how time functions and timestamp literals behave. Should be a time zone name like "America/Los_Angeles" or offset like "-08:00".|UTC| |`druid.sql.planner.metadataSegmentCacheEnable`|Whether to keep a cache of published segments in broker. If true, broker polls coordinator in background to get segments from metadata store and maintains a local cache. If false, coordinator's REST API will be invoked when broker needs published segments info.|false| @@ -1792,7 +1792,7 @@ This section describes configurations that control behavior of Druid's query typ ### Overriding default query context values -Any [Query Context General Parameter](../querying/query-context.html#general-parameters) default value can be +Any [Query Context General Parameter](../querying/query-context.md#general-parameters) default value can be overridden by setting runtime property in the format of `druid.query.default.context.{query_context_key}`. `druid.query.default.context.{query_context_key}` runtime property prefix applies to all current and future query context keys, the same as how query context parameter passed with the query works. Note that the runtime property @@ -1824,7 +1824,7 @@ context). If query does have `maxQueuedBytes` in the context, then that value is |Property|Description|Default| |--------|-----------|-------| -|`druid.query.topN.minTopNThreshold`|See [TopN Aliasing](../querying/topnquery.html#aliasing) for details.|1000| +|`druid.query.topN.minTopNThreshold`|See [TopN Aliasing](../querying/topnquery.md#aliasing) for details.|1000| ### Search query config @@ -1942,9 +1942,9 @@ Supported query contexts: |`druid.router.tierToBrokerMap`|Queries for a certain tier of data are routed to their appropriate Broker. This value should be an ordered JSON map of tiers to Broker names. The priority of Brokers is based on the ordering.|{"_default_tier": ""}| |`druid.router.defaultRule`|The default rule for all datasources.|"_default"| |`druid.router.pollPeriod`|How often to poll for new rules.|PT1M| -|`druid.router.strategies`|Please see [Router Strategies](../design/router.html#router-strategies) for details.|[{"type":"timeBoundary"},{"type":"priority"}]| -|`druid.router.avatica.balancer.type`|Class to use for balancing Avatica queries across Brokers. Please see [Avatica Query Balancing](../design/router.html#avatica-query-balancing).|rendezvousHash| -|`druid.router.managementProxy.enabled`|Enables the Router's [management proxy](../design/router.html#router-as-management-proxy) functionality.|false| +|`druid.router.strategies`|Please see [Router Strategies](../design/router.md#router-strategies) for details.|[{"type":"timeBoundary"},{"type":"priority"}]| +|`druid.router.avatica.balancer.type`|Class to use for balancing Avatica queries across Brokers. Please see [Avatica Query Balancing](../design/router.md#avatica-query-balancing).|rendezvousHash| +|`druid.router.managementProxy.enabled`|Enables the Router's [management proxy](../design/router.md#router-as-management-proxy) functionality.|false| |`druid.router.http.numConnections`|Size of connection pool for the Router to connect to Broker processes. If there are more queries than this number that all need to speak to the same process, then they will queue up.|`20`| |`druid.router.http.readTimeout`|The timeout for data reads from Broker processes.|`PT15M`| |`druid.router.http.numMaxThreads`|Maximum number of worker threads to handle HTTP requests and responses|`max(10, ((number of cores * 17) / 16 + 2) + 30)`| diff --git a/docs/configuration/logging.md b/docs/configuration/logging.md index 47d0e63750ff..3512df5b4b3c 100644 --- a/docs/configuration/logging.md +++ b/docs/configuration/logging.md @@ -23,7 +23,7 @@ title: "Logging" --> -Apache Druid processes will emit logs that are useful for debugging to the console. Druid processes also emit periodic metrics about their state. For more about metrics, see [Configuration](../configuration/index.html#enabling-metrics). Metric logs are printed to the console by default, and can be disabled with `-Ddruid.emitter.logging.logLevel=debug`. +Apache Druid processes will emit logs that are useful for debugging to the console. Druid processes also emit periodic metrics about their state. For more about metrics, see [Configuration](../configuration/index.md#enabling-metrics). Metric logs are printed to the console by default, and can be disabled with `-Ddruid.emitter.logging.logLevel=debug`. Druid uses [log4j2](http://logging.apache.org/log4j/2.x/) for logging. Logging can be configured with a log4j2.xml file. Add the path to the directory containing the log4j2.xml file (e.g. the _common/ dir) to your classpath if you want to override default Druid log configuration. Note that this directory should be earlier in the classpath than the druid jars. The easiest way to do this is to prefix the classpath with the config dir. diff --git a/docs/dependencies/metadata-storage.md b/docs/dependencies/metadata-storage.md index 51551a1df010..c253ed0e14f0 100644 --- a/docs/dependencies/metadata-storage.md +++ b/docs/dependencies/metadata-storage.md @@ -63,7 +63,7 @@ druid.metadata.storage.connector.dbcp.maxConnLifetimeMillis=1200000 druid.metadata.storage.connector.dbcp.defaultQueryTimeout=30000 ``` -See [BasicDataSource Configuration](https://commons.apache.org/proper/commons-dbcp/configuration.html) for full list. +See [BasicDataSource Configuration](https://commons.apache.org/proper/commons-dbcp/configuration) for full list. ## Metadata storage tables diff --git a/docs/design/architecture.md b/docs/design/architecture.md index 910af84ecc7f..0711f39a8659 100644 --- a/docs/design/architecture.md +++ b/docs/design/architecture.md @@ -118,7 +118,7 @@ to the [metadata store](#metadata-storage). This entry is a self-describing bit things like the schema of the segment, its size, and its location on deep storage. These entries are what the Coordinator uses to know what data *should* be available on the cluster. -For details on the segment file format, please see [segment files](segments.html). +For details on the segment file format, please see [segment files](segments.md). For details on modeling your data in Druid, see [schema design](../ingestion/schema-design.md). @@ -229,9 +229,9 @@ publish in an all-or-nothing manner: that has not yet been published can be rolled back if ingestion tasks fail. In this case, partially-ingested data is discarded, and Druid will resume ingestion from the last committed set of stream offsets. This ensures exactly-once publishing behavior. -- [Hadoop-based batch ingestion](../ingestion/hadoop.html). Each task publishes all segment metadata in a single +- [Hadoop-based batch ingestion](../ingestion/hadoop.md). Each task publishes all segment metadata in a single transaction. -- [Native batch ingestion](../ingestion/native-batch.html). In parallel mode, the supervisor task publishes all segment +- [Native batch ingestion](../ingestion/native-batch.md). In parallel mode, the supervisor task publishes all segment metadata in a single transaction after the subtasks are finished. In simple (single-task) mode, the single task publishes all segment metadata in a single transaction after it is complete. @@ -244,11 +244,11 @@ ingestion will not cause duplicate data to be ingested: - Supervised "seekable-stream" ingestion methods like [Kafka](../development/extensions-core/kafka-ingestion.md) and [Kinesis](../development/extensions-core/kinesis-ingestion.md) are idempotent due to the fact that stream offsets and segment metadata are stored together and updated in lock-step. -- [Hadoop-based batch ingestion](../ingestion/hadoop.html) is idempotent unless one of your input sources +- [Hadoop-based batch ingestion](../ingestion/hadoop.md) is idempotent unless one of your input sources is the same Druid datasource that you are ingesting into. In this case, running the same task twice is non-idempotent, because you are adding to existing data instead of overwriting it. -- [Native batch ingestion](../ingestion/native-batch.html) is idempotent unless -[`appendToExisting`](../ingestion/native-batch.html) is true, or one of your input sources is the same Druid datasource +- [Native batch ingestion](../ingestion/native-batch.md) is idempotent unless +[`appendToExisting`](../ingestion/native-batch.md) is true, or one of your input sources is the same Druid datasource that you are ingesting into. In either of these two cases, running the same task twice is non-idempotent, because you are adding to existing data instead of overwriting it. diff --git a/docs/design/broker.md b/docs/design/broker.md index 741dfc9a7cb7..cc0872c620de 100644 --- a/docs/design/broker.md +++ b/docs/design/broker.md @@ -25,11 +25,11 @@ title: "Broker" ### Configuration -For Apache Druid Broker Process Configuration, see [Broker Configuration](../configuration/index.html#broker). +For Apache Druid Broker Process Configuration, see [Broker Configuration](../configuration/index.md#broker). ### HTTP endpoints -For a list of API endpoints supported by the Broker, see [Broker API](../operations/api-reference.html#broker). +For a list of API endpoints supported by the Broker, see [Broker API](../operations/api-reference.md#broker). ### Overview diff --git a/docs/design/coordinator.md b/docs/design/coordinator.md index faba68bcb37d..d5769a935ce9 100644 --- a/docs/design/coordinator.md +++ b/docs/design/coordinator.md @@ -25,11 +25,11 @@ title: "Coordinator Process" ### Configuration -For Apache Druid Coordinator Process Configuration, see [Coordinator Configuration](../configuration/index.html#coordinator). +For Apache Druid Coordinator Process Configuration, see [Coordinator Configuration](../configuration/index.md#coordinator). ### HTTP endpoints -For a list of API endpoints supported by the Coordinator, see [Coordinator API](../operations/api-reference.html#coordinator). +For a list of API endpoints supported by the Coordinator, see [Coordinator API](../operations/api-reference.md#coordinator). ### Overview @@ -89,12 +89,12 @@ Once some segments are found, it issues a [compaction task](../ingestion/tasks.m The maximum number of running compaction tasks is `min(sum of worker capacity * slotRatio, maxSlots)`. Note that even though `min(sum of worker capacity * slotRatio, maxSlots)` = 0, at least one compaction task is always submitted if the compaction is enabled for a dataSource. -See [Compaction Configuration API](../operations/api-reference.html#compaction-configuration) and [Compaction Configuration](../configuration/index.html#compaction-dynamic-configuration) to enable the compaction. +See [Compaction Configuration API](../operations/api-reference.md#compaction-configuration) and [Compaction Configuration](../configuration/index.md#compaction-dynamic-configuration) to enable the compaction. Compaction tasks might fail due to the following reasons. - If the input segments of a compaction task are removed or overshadowed before it starts, that compaction task fails immediately. -- If a task of a higher priority acquires a [time chunk lock](../ingestion/tasks.html#locking) for an interval overlapping with the interval of a compaction task, the compaction task fails. +- If a task of a higher priority acquires a [time chunk lock](../ingestion/tasks.md#locking) for an interval overlapping with the interval of a compaction task, the compaction task fails. Once a compaction task fails, the Coordinator simply checks the segments in the interval of the failed task again, and issues another compaction task in the next run. @@ -127,7 +127,7 @@ If the coordinator has enough task slots for compaction, this policy will contin `bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION` and `bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION_1`. Finally, `foo_2017-09-01T00:00:00.000Z_2017-10-01T00:00:00.000Z_VERSION` will be picked up even though there is only one segment in the time chunk of `2017-09-01T00:00:00.000Z/2017-10-01T00:00:00.000Z`. -The search start point can be changed by setting [skipOffsetFromLatest](../configuration/index.html#compaction-dynamic-configuration). +The search start point can be changed by setting [skipOffsetFromLatest](../configuration/index.md#compaction-dynamic-configuration). If this is set, this policy will ignore the segments falling into the time chunk of (the end time of the most recent segment - `skipOffsetFromLatest`). This is to avoid conflicts between compaction tasks and realtime tasks. Note that realtime tasks have a higher priority than compaction tasks by default. Realtime tasks will revoke the locks of compaction tasks if their intervals overlap, resulting in the termination of the compaction task. @@ -138,7 +138,7 @@ Note that realtime tasks have a higher priority than compaction tasks by default ### The Coordinator console -The Druid Coordinator exposes a web GUI for displaying cluster information and rule configuration. For more details, please see [coordinator console](../operations/management-uis.html#coordinator-consoles). +The Druid Coordinator exposes a web GUI for displaying cluster information and rule configuration. For more details, please see [coordinator console](../operations/management-uis.md#coordinator-consoles). ### FAQ diff --git a/docs/design/historical.md b/docs/design/historical.md index 4a6768691fe4..159b31392277 100644 --- a/docs/design/historical.md +++ b/docs/design/historical.md @@ -25,11 +25,11 @@ title: "Historical Process" ### Configuration -For Apache Druid Historical Process Configuration, see [Historical Configuration](../configuration/index.html#historical). +For Apache Druid Historical Process Configuration, see [Historical Configuration](../configuration/index.md#historical). ### HTTP endpoints -For a list of API endpoints supported by the Historical, please see the [API reference](../operations/api-reference.html#historical). +For a list of API endpoints supported by the Historical, please see the [API reference](../operations/api-reference.md#historical). ### Running diff --git a/docs/design/index.md b/docs/design/index.md index d803a704abee..3072ce397ff6 100644 --- a/docs/design/index.md +++ b/docs/design/index.md @@ -58,7 +58,7 @@ Druid servers fail, the system will automatically route around the damage until is designed to run 24/7 with no need for planned downtimes for any reason, including configuration changes and software updates. 6. **Cloud-native, fault-tolerant architecture that won't lose data.** Once Druid has ingested your data, a copy is -stored safely in [deep storage](architecture.html#deep-storage) (typically cloud storage, HDFS, or a shared filesystem). +stored safely in [deep storage](architecture.md#deep-storage) (typically cloud storage, HDFS, or a shared filesystem). Your data can be recovered from deep storage even if every single Druid server fails. For more limited failures affecting just a few Druid servers, replication ensures that queries are still possible while the system recovers. 7. **Indexes for quick filtering.** Druid uses [Roaring](https://roaringbitmap.org/) or @@ -77,7 +77,7 @@ summarization partially pre-aggregates your data, and can lead to big costs savi ## When should I use Druid? Druid is used by many companies of various sizes for many different use cases. Check out the -[Powered by Apache Druid](/druid-powered) page +[Powered by Apache Druid](https://druid.apache.org/druid-powered) page Druid is likely a good choice if your use case fits a few of the following descriptors: diff --git a/docs/design/indexer.md b/docs/design/indexer.md index 791bde1a44b8..ea6594c7c9f5 100644 --- a/docs/design/indexer.md +++ b/docs/design/indexer.md @@ -22,7 +22,7 @@ title: "Indexer Process" ~ under the License. --> -> The Indexer is an optional and experimental feature. +> The Indexer is an optional and [experimental](../../development/experimental) feature. > Its memory management system is still under development and will be significantly enhanced in later releases. The Apache Druid Indexer process is an alternative to the MiddleManager + Peon task execution system. Instead of forking a separate JVM process per-task, the Indexer runs tasks as separate threads within a single JVM process. @@ -31,11 +31,11 @@ The Indexer is designed to be easier to configure and deploy compared to the Mid ### Configuration -For Apache Druid Indexer Process Configuration, see [Indexer Configuration](../configuration/index.html#indexer). +For Apache Druid Indexer Process Configuration, see [Indexer Configuration](../configuration/index.md#indexer). ### HTTP endpoints -The Indexer process shares the same HTTP endpoints as the [MiddleManager](../operations/api-reference.html#middlemanager). +The Indexer process shares the same HTTP endpoints as the [MiddleManager](../operations/api-reference.md#middlemanager). ### Running @@ -51,7 +51,7 @@ The following resources are shared across all tasks running inside an Indexer pr The query processing threads and buffers are shared across all tasks. The Indexer will serve queries from a single endpoint shared by all tasks. -If [query caching](../configuration/index.html#indexer-caching) is enabled, the query cache is also shared across all tasks. +If [query caching](../configuration/index.md#indexer-caching) is enabled, the query cache is also shared across all tasks. #### Server HTTP threads diff --git a/docs/design/indexing-service.md b/docs/design/indexing-service.md index d7bc46f89eae..392732b3cebd 100644 --- a/docs/design/indexing-service.md +++ b/docs/design/indexing-service.md @@ -30,7 +30,7 @@ Indexing [tasks](../ingestion/tasks.md) create (and sometimes destroy) Druid [se The indexing service is composed of three main components: a [Peon](../design/peons.md) component that can run a single task, a [Middle Manager](../design/middlemanager.md) component that manages Peons, and an [Overlord](../design/overlord.md) component that manages task distribution to MiddleManagers. Overlords and MiddleManagers may run on the same process or across multiple processes while MiddleManagers and Peons always run on the same process. -Tasks are managed using API endpoints on the Overlord service. Please see [Overlord Task API](../operations/api-reference.html#tasks) for more information. +Tasks are managed using API endpoints on the Overlord service. Please see [Overlord Task API](../operations/api-reference.md#tasks) for more information. ![Indexing Service](../assets/indexing_service.png "Indexing Service") diff --git a/docs/design/middlemanager.md b/docs/design/middlemanager.md index 89301f494d94..6a59debe1339 100644 --- a/docs/design/middlemanager.md +++ b/docs/design/middlemanager.md @@ -25,11 +25,11 @@ title: "MiddleManager Process" ### Configuration -For Apache Druid MiddleManager Process Configuration, see [Indexing Service Configuration](../configuration/index.html#middlemanager-and-peons). +For Apache Druid MiddleManager Process Configuration, see [Indexing Service Configuration](../configuration/index.md#middlemanager-and-peons). ### HTTP endpoints -For a list of API endpoints supported by the MiddleManager, please see the [API reference](../operations/api-reference.html#middlemanager). +For a list of API endpoints supported by the MiddleManager, please see the [API reference](../operations/api-reference.md#middlemanager). ### Overview diff --git a/docs/design/overlord.md b/docs/design/overlord.md index b7a4a3f704e4..0abd3e798dc9 100644 --- a/docs/design/overlord.md +++ b/docs/design/overlord.md @@ -25,11 +25,11 @@ title: "Overlord Process" ### Configuration -For Apache Druid Overlord Process Configuration, see [Overlord Configuration](../configuration/index.html#overlord). +For Apache Druid Overlord Process Configuration, see [Overlord Configuration](../configuration/index.md#overlord). ### HTTP endpoints -For a list of API endpoints supported by the Overlord, please see the [API reference](../operations/api-reference.html#overlord). +For a list of API endpoints supported by the Overlord, please see the [API reference](../operations/api-reference.md#overlord). ### Overview @@ -40,7 +40,7 @@ This mode is recommended if you intend to use the indexing service as the single ### Overlord console -The Overlord provides a UI for managing tasks and workers. For more details, please see [overlord console](../operations/management-uis.html#overlord-console). +The Overlord provides a UI for managing tasks and workers. For more details, please see [overlord console](../operations/management-uis.md#overlord-console). ### Blacklisted workers diff --git a/docs/design/peons.md b/docs/design/peons.md index 72eb72e1a78b..86a91612bcdd 100644 --- a/docs/design/peons.md +++ b/docs/design/peons.md @@ -25,11 +25,11 @@ title: "Peons" ### Configuration -For Apache Druid Peon Configuration, see [Peon Query Configuration](../configuration/index.html#peon-query-configuration) and [Additional Peon Configuration](../configuration/index.html#additional-peon-configuration). +For Apache Druid Peon Configuration, see [Peon Query Configuration](../configuration/index.md#peon-query-configuration) and [Additional Peon Configuration](../configuration/index.md#additional-peon-configuration). ### HTTP endpoints -For a list of API endpoints supported by the Peon, please see the [Peon API reference](../operations/api-reference.html#peon). +For a list of API endpoints supported by the Peon, please see the [Peon API reference](../operations/api-reference.md#peon). Peons run a single task in a single JVM. MiddleManager is responsible for creating Peons for running tasks. Peons should rarely (if ever for testing purposes) be run on their own. diff --git a/docs/design/processes.md b/docs/design/processes.md index 4c1e46a46a77..fd65338c891a 100644 --- a/docs/design/processes.md +++ b/docs/design/processes.md @@ -134,7 +134,7 @@ In clusters with very high segment counts, it can make sense to separate the Coo The Coordinator and Overlord processes can be run as a single combined process by setting the `druid.coordinator.asOverlord.enabled` property. -Please see [Coordinator Configuration: Operation](../configuration/index.html#coordinator-operation) for details. +Please see [Coordinator Configuration: Operation](../configuration/index.md#coordinator-operation) for details. ### Historicals and MiddleManagers diff --git a/docs/design/router.md b/docs/design/router.md index cc037e229da8..a2ccbd79bfde 100644 --- a/docs/design/router.md +++ b/docs/design/router.md @@ -34,11 +34,11 @@ In addition to query routing, the Router also runs the [Druid Console](../operat ### Configuration -For Apache Druid Router Process Configuration, see [Router Configuration](../configuration/index.html#router). +For Apache Druid Router Process Configuration, see [Router Configuration](../configuration/index.md#router). ### HTTP endpoints -For a list of API endpoints supported by the Router, see [Router API](../operations/api-reference.html#router). +For a list of API endpoints supported by the Router, see [Router API](../operations/api-reference.md#router). ### Running diff --git a/docs/development/extensions-contrib/materialized-view.md b/docs/development/extensions-contrib/materialized-view.md index 0b2eb3d571bb..4e7f31ee6d4b 100644 --- a/docs/development/extensions-contrib/materialized-view.md +++ b/docs/development/extensions-contrib/materialized-view.md @@ -73,7 +73,7 @@ A sample derivativeDataSource supervisor spec is shown below: |baseDataSource |The name of base dataSource. This dataSource data should be already stored inside Druid, and the dataSource will be used as input data.|yes| |dimensionsSpec |Specifies the dimensions of the data. These dimensions must be the subset of baseDataSource's dimensions.|yes| |metricsSpec |A list of aggregators. These metrics must be the subset of baseDataSource's metrics. See [aggregations](../../querying/aggregations.md).|yes| -|tuningConfig |TuningConfig must be HadoopTuningConfig. See [Hadoop tuning config](../../ingestion/hadoop.html#tuningconfig).|yes| +|tuningConfig |TuningConfig must be HadoopTuningConfig. See [Hadoop tuning config](../../ingestion/hadoop.md#tuningconfig).|yes| |dataSource |The name of this derived dataSource. |no(default=baseDataSource-hashCode of supervisor)| |hadoopDependencyCoordinates |A JSON array of Hadoop dependency coordinates that Druid will use, this property will override the default Hadoop coordinates. Once specified, Druid will look for those Hadoop dependencies from the location specified by druid.extensions.hadoopDependenciesDir |no| |classpathPrefix |Classpath that will be prepended for the Peon process. |no| diff --git a/docs/development/extensions-contrib/moving-average-query.md b/docs/development/extensions-contrib/moving-average-query.md index 3e9bb536ae16..db341be260dd 100644 --- a/docs/development/extensions-contrib/moving-average-query.md +++ b/docs/development/extensions-contrib/moving-average-query.md @@ -34,7 +34,7 @@ Moving Average encapsulates the [groupBy query](../../querying/groupbyquery.md) It runs the query in two main phases: -1. Runs an inner [groupBy](../../querying/groupbyquery.html) or [timeseries](../../querying/timeseriesquery.html) query to compute Aggregators (i.e. daily count of events). +1. Runs an inner [groupBy](../../querying/groupbyquery.md) or [timeseries](../../querying/timeseriesquery.md) query to compute Aggregators (i.e. daily count of events). 2. Passes over aggregated results in Broker, in order to compute Averagers (i.e. moving 7 day average of the daily count). #### Main enhancements provided by this extension: @@ -70,7 +70,7 @@ There are currently no configuration properties specific to Moving Average. |dimensions|A JSON list of [DimensionSpec](../../querying/dimensionspecs.md) (Notice that property is optional)|no| |limitSpec|See [LimitSpec](../../querying/limitspec.md)|no| |having|See [Having](../../querying/having.md)|no| -|granularity|A period granularity; See [Period Granularities](../../querying/granularities.html#period-granularities)|yes| +|granularity|A period granularity; See [Period Granularities](../../querying/granularities.md#period-granularities)|yes| |filter|See [Filters](../../querying/filters.md)|no| |aggregations|Aggregations forms the input to Averagers; See [Aggregations](../../querying/aggregations.md)|yes| |postAggregations|Supports only aggregations as input; See [Post Aggregations](../../querying/post-aggregations.md)|no| diff --git a/docs/development/extensions-contrib/redis-cache.md b/docs/development/extensions-contrib/redis-cache.md index 7db423dff766..4bd85e9cc506 100644 --- a/docs/development/extensions-contrib/redis-cache.md +++ b/docs/development/extensions-contrib/redis-cache.md @@ -39,9 +39,9 @@ java -classpath "druid_dir/lib/*" org.apache.druid.cli.Main tools pull-deps -c o To enable this extension after installation, 1. [include](../../development/extensions.md#loading-extensions) this `druid-redis-cache` extension -2. to enable cache on broker nodes, follow [broker caching docs](../../configuration/index.html#broker-caching) to set related properties -3. to enable cache on historical nodes, follow [historical caching docs](../../configuration/index.html#historical-caching) to set related properties -4. to enable cache on middle manager nodes, follow [peon caching docs](../../configuration/index.html#peon-caching) to set related properties +2. to enable cache on broker nodes, follow [broker caching docs](../../configuration/index.md#broker-caching) to set related properties +3. to enable cache on historical nodes, follow [historical caching docs](../../configuration/index.md#historical-caching) to set related properties +4. to enable cache on middle manager nodes, follow [peon caching docs](../../configuration/index.md#peon-caching) to set related properties 5. set `druid.cache.type` to `redis` 6. add the following properties diff --git a/docs/development/extensions-contrib/statsd.md b/docs/development/extensions-contrib/statsd.md index 7a6dd6bccd35..144c216da8fb 100644 --- a/docs/development/extensions-contrib/statsd.md +++ b/docs/development/extensions-contrib/statsd.md @@ -47,7 +47,7 @@ All the configuration parameters for the StatsD emitter are under `druid.emitter |`druid.emitter.statsd.dogstatsd`|Flag to enable [DogStatsD](https://docs.datadoghq.com/developers/dogstatsd/) support. Causes dimensions to be included as tags, not as a part of the metric name. `convertRange` fields will be ignored.|no|false| |`druid.emitter.statsd.dogstatsdConstantTags`|If `druid.emitter.statsd.dogstatsd` is true, the tags in the JSON list of strings will be sent with every event.|no|[]| |`druid.emitter.statsd.dogstatsdServiceAsTag`|If `druid.emitter.statsd.dogstatsd` and `druid.emitter.statsd.dogstatsdServiceAsTag` are true, druid service (e.g. `druid/broker`, `druid/coordinator`, etc) is reported as a tag (e.g. `druid_service:druid/broker`) instead of being included in metric name (e.g. `druid.broker.query.time`) and `druid` is used as metric prefix (e.g. `druid.query.time`).|no|false| -|`druid.emitter.statsd.dogstatsdEvents`|If `druid.emitter.statsd.dogstatsd` and `druid.emitter.statsd.dogstatsdEvents` are true, [Alert events](../../operations/alerts.html) are reported to DogStatsD.|no|false| +|`druid.emitter.statsd.dogstatsdEvents`|If `druid.emitter.statsd.dogstatsd` and `druid.emitter.statsd.dogstatsdEvents` are true, [Alert events](../../operations/alerts.md) are reported to DogStatsD.|no|false| ### Druid to StatsD Event Converter diff --git a/docs/development/extensions-core/bloom-filter.md b/docs/development/extensions-core/bloom-filter.md index 602f1f231869..42b1fae984c9 100644 --- a/docs/development/extensions-core/bloom-filter.md +++ b/docs/development/extensions-core/bloom-filter.md @@ -76,7 +76,7 @@ This string can then be used in the native or SQL Druid query. |`type` |Filter Type. Should always be `bloom`|yes| |`dimension` |The dimension to filter over. | yes | |`bloomKFilter` |Base64 encoded Binary representation of `org.apache.hive.common.util.BloomKFilter`| yes | -|`extractionFn`|[Extraction function](../../querying/dimensionspecs.html#extraction-functions) to apply to the dimension values |no| +|`extractionFn`|[Extraction function](../../querying/dimensionspecs.md#extraction-functions) to apply to the dimension values |no| ### Serialized Format for BloomKFilter diff --git a/docs/development/extensions-core/datasketches-extension.md b/docs/development/extensions-core/datasketches-extension.md index 07d10214e42a..df581616704e 100644 --- a/docs/development/extensions-core/datasketches-extension.md +++ b/docs/development/extensions-core/datasketches-extension.md @@ -33,7 +33,7 @@ druid.extensions.loadList=["druid-datasketches"] The following modules are available: -* [Theta sketch](datasketches-theta.html) - approximate distinct counting with set operations (union, intersection and set difference). -* [Tuple sketch](datasketches-tuple.html) - extension of Theta sketch to support values associated with distinct keys (arrays of numeric values in this specialized implementation). -* [Quantiles sketch](datasketches-quantiles.html) - approximate distribution of comparable values to obtain ranks, quantiles and histograms. This is a specialized implementation for numeric values. -* [HLL sketch](datasketches-hll.html) - approximate distinct counting using very compact HLL sketch. +* [Theta sketch](datasketches-theta.md) - approximate distinct counting with set operations (union, intersection and set difference). +* [Tuple sketch](datasketches-tuple.md) - extension of Theta sketch to support values associated with distinct keys (arrays of numeric values in this specialized implementation). +* [Quantiles sketch](datasketches-quantiles.md) - approximate distribution of comparable values to obtain ranks, quantiles and histograms. This is a specialized implementation for numeric values. +* [HLL sketch](datasketches-hll.md) - approximate distinct counting using very compact HLL sketch. diff --git a/docs/development/extensions-core/datasketches-quantiles.md b/docs/development/extensions-core/datasketches-quantiles.md index 031fb02f8981..31313a226c06 100644 --- a/docs/development/extensions-core/datasketches-quantiles.md +++ b/docs/development/extensions-core/datasketches-quantiles.md @@ -23,7 +23,7 @@ title: "DataSketches Quantiles Sketch module" --> -This module provides Apache Druid aggregators based on numeric quantiles DoublesSketch from [Apache DataSketches](https://datasketches.apache.org/) library. Quantiles sketch is a mergeable streaming algorithm to estimate the distribution of values, and approximately answer queries about the rank of a value, probability mass function of the distribution (PMF) or histogram, cumulative distribution function (CDF), and quantiles (median, min, max, 95th percentile and such). See [Quantiles Sketch Overview](https://datasketches.apache.org/docs/Quantiles/QuantilesOverview.html). +This module provides Apache Druid aggregators based on numeric quantiles DoublesSketch from [Apache DataSketches](https://datasketches.apache.org/) library. Quantiles sketch is a mergeable streaming algorithm to estimate the distribution of values, and approximately answer queries about the rank of a value, probability mass function of the distribution (PMF) or histogram, cumulative distribution function (CDF), and quantiles (median, min, max, 95th percentile and such). See [Quantiles Sketch Overview](https://datasketches.apache.org/docs/Quantiles/QuantilesOverview). There are three major modes of operation: @@ -55,7 +55,7 @@ The result of the aggregation is a DoublesSketch that is the union of all sketch |type|This String should always be "quantilesDoublesSketch"|yes| |name|A String for the output (result) name of the calculation.|yes| |fieldName|A String for the name of the input field (can contain sketches or raw numeric values).|yes| -|k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2 from 2 to 32768. See the [Quantiles Accuracy](https://datasketches.apache.org/docs/Quantiles/QuantilesAccuracy.html) for details. |no, defaults to 128| +|k|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2 from 2 to 32768. See the [Quantiles Accuracy](https://datasketches.apache.org/docs/Quantiles/QuantilesAccuracy) for details. |no, defaults to 128| ### Post Aggregators diff --git a/docs/development/extensions-core/datasketches-theta.md b/docs/development/extensions-core/datasketches-theta.md index 5b5564e714d9..83247e8bedaa 100644 --- a/docs/development/extensions-core/datasketches-theta.md +++ b/docs/development/extensions-core/datasketches-theta.md @@ -51,7 +51,7 @@ druid.extensions.loadList=["druid-datasketches"] |name|A String for the output (result) name of the calculation.|yes| |fieldName|A String for the name of the aggregator used at ingestion time.|yes| |isInputThetaSketch|This should only be used at indexing time if your input data contains theta sketch objects. This would be the case if you use datasketches library outside of Druid, say with Pig/Hive, to produce the data that you are ingesting into Druid |no, defaults to false| -|size|Must be a power of 2. Internally, size refers to the maximum number of entries sketch object will retain. Higher size means higher accuracy but more space to store sketches. Note that after you index with a particular size, druid will persist sketch in segments and you will use size greater or equal to that at query time. See the [DataSketches site](https://datasketches.apache.org/docs/Theta/ThetaSize.html) for details. In general, We recommend just sticking to default size. |no, defaults to 16384| +|size|Must be a power of 2. Internally, size refers to the maximum number of entries sketch object will retain. Higher size means higher accuracy but more space to store sketches. Note that after you index with a particular size, druid will persist sketch in segments and you will use size greater or equal to that at query time. See the [DataSketches site](https://datasketches.apache.org/docs/Theta/ThetaSize) for details. In general, We recommend just sticking to default size. |no, defaults to 16384| ### Post Aggregators diff --git a/docs/development/extensions-core/datasketches-tuple.md b/docs/development/extensions-core/datasketches-tuple.md index b1150c5e7c74..13c4590bc7cd 100644 --- a/docs/development/extensions-core/datasketches-tuple.md +++ b/docs/development/extensions-core/datasketches-tuple.md @@ -49,7 +49,7 @@ druid.extensions.loadList=["druid-datasketches"] |type|This String should always be "arrayOfDoublesSketch"|yes| |name|A String for the output (result) name of the calculation.|yes| |fieldName|A String for the name of the input field.|yes| -|nominalEntries|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2. See the [Theta sketch accuracy](https://datasketches.apache.org/docs/Theta/ThetaErrorTable.html) for details. |no, defaults to 16384| +|nominalEntries|Parameter that determines the accuracy and size of the sketch. Higher k means higher accuracy but more space to store sketches. Must be a power of 2. See the [Theta sketch accuracy](https://datasketches.apache.org/docs/Theta/ThetaErrorTable) for details. |no, defaults to 16384| |numberOfValues|Number of values associated with each distinct key. |no, defaults to 1| |metricColumns|If building sketches from raw data, an array of names of the input columns containing numeric values to be associated with each distinct key.|no, defaults to empty array| @@ -118,7 +118,7 @@ Returns a list of variance values from a given ArrayOfDoublesSketch. The result #### Quantiles sketch from a column -Returns a quantiles DoublesSketch constructed from a given column of values from a given ArrayOfDoublesSketch using optional parameter k that determines the accuracy and size of the quantiles sketch. See [Quantiles Sketch Module](datasketches-quantiles.html) +Returns a quantiles DoublesSketch constructed from a given column of values from a given ArrayOfDoublesSketch using optional parameter k that determines the accuracy and size of the quantiles sketch. See [Quantiles Sketch Module](datasketches-quantiles.md) * The column number is 1-based and is optional (the default is 1). * The parameter k is optional (the default is defined in the sketch library). @@ -151,7 +151,7 @@ Returns a result of a specified set operation on the given array of sketches. Su #### Student's t-test -Performs Student's t-test and returns a list of p-values given two instances of ArrayOfDoublesSketch. The result will be N double values, where N is the number of double values kept in the sketch per key. See [t-test documentation](http://commons.apache.org/proper/commons-math/javadocs/api-3.4/org/apache/commons/math3/stat/inference/TTest.html). +Performs Student's t-test and returns a list of p-values given two instances of ArrayOfDoublesSketch. The result will be N double values, where N is the number of double values kept in the sketch per key. See [t-test documentation](http://commons.apache.org/proper/commons-math/javadocs/api-3.4/org/apache/commons/math3/stat/inference/TTest). ```json { diff --git a/docs/development/extensions-core/druid-basic-security.md b/docs/development/extensions-core/druid-basic-security.md index 892306e4a07b..6b9de538fe84 100644 --- a/docs/development/extensions-core/druid-basic-security.md +++ b/docs/development/extensions-core/druid-basic-security.md @@ -523,11 +523,11 @@ GET requires READ permission, while POST and DELETE require WRITE permission. Queries on Druid datasources require DATASOURCE READ permissions for the specified datasource. -Queries on the [INFORMATION_SCHEMA tables](../../querying/sql.html#information-schema) will +Queries on the [INFORMATION_SCHEMA tables](../../querying/sql.md#information-schema) will return information about datasources that the caller has DATASOURCE READ access to. Other datasources will be omitted. -Queries on the [system schema tables](../../querying/sql.html#system-schema) require the following permissions: +Queries on the [system schema tables](../../querying/sql.md#system-schema) require the following permissions: - `segments`: Segments will be filtered based on DATASOURCE READ permissions. - `servers`: The user requires STATE READ permissions. - `server_segments`: The user requires STATE READ permissions and segments will be filtered based on DATASOURCE READ permissions. diff --git a/docs/development/extensions-core/druid-pac4j.md b/docs/development/extensions-core/druid-pac4j.md index 3dea81146714..3855955fc8b2 100644 --- a/docs/development/extensions-core/druid-pac4j.md +++ b/docs/development/extensions-core/druid-pac4j.md @@ -25,7 +25,7 @@ title: "Druid pac4j based Security extension" Apache Druid Extension to enable [OpenID Connect](https://openid.net/connect/) based Authentication for Druid Processes using [pac4j](https://github.com/pac4j/pac4j) as the underlying client library. This can be used with any authentication server that supports same e.g. [Okta](https://developer.okta.com/). -This extension should only be used at the router node to enable a group of users in existing authentication server to interact with Druid cluster, using the [Web Console](../../operations/druid-console.html). This extension does not support JDBC client authentication. +This extension should only be used at the router node to enable a group of users in existing authentication server to interact with Druid cluster, using the [Web Console](../../operations/druid-console.md). This extension does not support JDBC client authentication. ## Configuration diff --git a/docs/development/extensions-core/druid-ranger-security.md b/docs/development/extensions-core/druid-ranger-security.md index 77f3eb9ce62b..4bc22787d40a 100644 --- a/docs/development/extensions-core/druid-ranger-security.md +++ b/docs/development/extensions-core/druid-ranger-security.md @@ -106,9 +106,9 @@ GET requires READ permission, while POST and DELETE require WRITE permission. Queries on Druid datasources require DATASOURCE READ permissions for the specified datasource. -Queries on the [INFORMATION_SCHEMA tables](../../querying/sql.html#information-schema) will return information about datasources that the caller has DATASOURCE READ access to. Other datasources will be omitted. +Queries on the [INFORMATION_SCHEMA tables](../../querying/sql.md#information-schema) will return information about datasources that the caller has DATASOURCE READ access to. Other datasources will be omitted. -Queries on the [system schema tables](../../querying/sql.html#system-schema) require the following permissions: +Queries on the [system schema tables](../../querying/sql.md#system-schema) require the following permissions: - `segments`: Segments will be filtered based on DATASOURCE READ permissions. - `servers`: The user requires STATE READ permissions. - `server_segments`: The user requires STATE READ permissions and segments will be filtered based on DATASOURCE READ permissions. diff --git a/docs/development/extensions-core/hdfs.md b/docs/development/extensions-core/hdfs.md index 711404703b72..bd642c60bfad 100644 --- a/docs/development/extensions-core/hdfs.md +++ b/docs/development/extensions-core/hdfs.md @@ -55,7 +55,7 @@ To use the AWS S3 as the deep storage, you need to configure `druid.storage.stor |`druid.storage.type`|hdfs| |Must be set.| |`druid.storage.storageDirectory`|s3a://bucket/example/directory or s3n://bucket/example/directory|Path to the deep storage|Must be set.| -You also need to include the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html), especially the `hadoop-aws.jar` in the Druid classpath. +You also need to include the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/), especially the `hadoop-aws.jar` in the Druid classpath. Run the below command to install the `hadoop-aws.jar` file under `${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes. ```bash @@ -64,7 +64,7 @@ cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${H ``` Finally, you need to add the below properties in the `core-site.xml`. -For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). +For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/). ```xml @@ -111,7 +111,7 @@ and authentication properties needed for GCS. You may want to copy the below example properties. Please follow the instructions at [https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md) for more details. -For more configurations, [GCS core default](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/conf/gcs-core-default.xml) +For more configurations, [GCS core default](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v2.0.0/gcs/conf/gcs-core-default.xml) and [GCS core template](https://github.com/GoogleCloudPlatform/bdutil/blob/master/conf/hadoop2/gcs-core-template.xml). ```xml diff --git a/docs/development/extensions-core/kafka-ingestion.md b/docs/development/extensions-core/kafka-ingestion.md index 3eb3499fb9e7..a0a8c13d18f1 100644 --- a/docs/development/extensions-core/kafka-ingestion.md +++ b/docs/development/extensions-core/kafka-ingestion.md @@ -177,7 +177,7 @@ The tuningConfig is optional and default parameters will be used if no tuningCon | `indexSpecForIntermediatePersists`| | Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. This can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. However, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](#indexspec) for possible values. | no (default = same as indexSpec) | | `reportParseExceptions` | Boolean | *DEPRECATED*. If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped. Setting `reportParseExceptions` to true will override existing configurations for `maxParseExceptions` and `maxSavedParseExceptions`, setting `maxParseExceptions` to 0 and limiting `maxSavedParseExceptions` to no more than 1. | no (default == false) | | `handoffConditionTimeout` | Long | Milliseconds to wait for segment handoff. It must be >= 0, where 0 means to wait forever. | no (default == 0) | -| `resetOffsetAutomatically` | Boolean | Controls behavior when Druid needs to read Kafka messages that are no longer available (i.e. when OffsetOutOfRangeException is encountered).

If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation; potentially using the [Reset Supervisor API](../../operations/api-reference.html#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.

If true, Druid will automatically reset to the earlier or latest offset available in Kafka, based on the value of the `useEarliestOffset` property (earliest if true, latest if false). Please note that this can lead to data being _DROPPED_ (if `useEarliestOffset` is false) or _DUPLICATED_ (if `useEarliestOffset` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful for non-production situations, since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.

This feature behaves similarly to the Kafka `auto.offset.reset` consumer property. | no (default == false) | +| `resetOffsetAutomatically` | Boolean | Controls behavior when Druid needs to read Kafka messages that are no longer available (i.e. when OffsetOutOfRangeException is encountered).

If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation; potentially using the [Reset Supervisor API](../../operations/api-reference.md#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.

If true, Druid will automatically reset to the earlier or latest offset available in Kafka, based on the value of the `useEarliestOffset` property (earliest if true, latest if false). Please note that this can lead to data being _DROPPED_ (if `useEarliestOffset` is false) or _DUPLICATED_ (if `useEarliestOffset` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful for non-production situations, since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data.

This feature behaves similarly to the Kafka `auto.offset.reset` consumer property. | no (default == false) | | `workerThreads` | Integer | The number of threads that the supervisor uses to handle requests/responses for worker tasks, along with any other internal asynchronous operation. | no (default == min(10, taskCount)) | | `chatThreads` | Integer | The number of threads that will be used for communicating with indexing tasks. | no (default == min(10, taskCount * replicas)) | | `chatRetries` | Integer | The number of times HTTP requests to indexing tasks will be retried before considering tasks unresponsive. | no (default == 8) | @@ -218,12 +218,12 @@ For Concise bitmaps: |Field|Type|Description|Required| |-----|----|-----------|--------| -|`type`|String|See [Additional Peon Configuration: SegmentWriteOutMediumFactory](../../configuration/index.html#segmentwriteoutmediumfactory) for explanation and available options.|yes| +|`type`|String|See [Additional Peon Configuration: SegmentWriteOutMediumFactory](../../configuration/index.md#segmentwriteoutmediumfactory) for explanation and available options.|yes| ## Operations This section gives descriptions of how some supervisor APIs work specifically in Kafka Indexing Service. -For all supervisor APIs, please check [Supervisor APIs](../../operations/api-reference.html#supervisors). +For all supervisor APIs, please check [Supervisor APIs](../../operations/api-reference.md#supervisors). ### Getting Supervisor Status Report diff --git a/docs/development/extensions-core/kinesis-ingestion.md b/docs/development/extensions-core/kinesis-ingestion.md index 18a4c4a38282..83c122a181be 100644 --- a/docs/development/extensions-core/kinesis-ingestion.md +++ b/docs/development/extensions-core/kinesis-ingestion.md @@ -174,7 +174,7 @@ The tuningConfig is optional and default parameters will be used if no tuningCon | `indexSpecForIntermediatePersists` | | Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. This can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. However, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](#indexspec) for possible values. | no (default = same as indexSpec) | | `reportParseExceptions` | Boolean | If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped. | no (default == false) | | `handoffConditionTimeout` | Long | Milliseconds to wait for segment handoff. It must be >= 0, where 0 means to wait forever. | no (default == 0) | -| `resetOffsetAutomatically` | Boolean | Controls behavior when Druid needs to read Kinesis messages that are no longer available.

If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation; potentially using the [Reset Supervisor API](../../operations/api-reference.html#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.

If true, Druid will automatically reset to the earlier or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Please note that this can lead to data being _DROPPED_ (if `useEarliestSequenceNumber` is false) or _DUPLICATED_ (if `useEarliestSequenceNumber` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful for non-production situations, since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data. | no (default == false) | +| `resetOffsetAutomatically` | Boolean | Controls behavior when Druid needs to read Kinesis messages that are no longer available.

If false, the exception will bubble up, which will cause your tasks to fail and ingestion to halt. If this occurs, manual intervention is required to correct the situation; potentially using the [Reset Supervisor API](../../operations/api-reference.md#supervisors). This mode is useful for production, since it will make you aware of issues with ingestion.

If true, Druid will automatically reset to the earlier or latest sequence number available in Kinesis, based on the value of the `useEarliestSequenceNumber` property (earliest if true, latest if false). Please note that this can lead to data being _DROPPED_ (if `useEarliestSequenceNumber` is false) or _DUPLICATED_ (if `useEarliestSequenceNumber` is true) without your knowledge. Messages will be logged indicating that a reset has occurred, but ingestion will continue. This mode is useful for non-production situations, since it will make Druid attempt to recover from problems automatically, even if they lead to quiet dropping or duplicating of data. | no (default == false) | | `skipSequenceNumberAvailabilityCheck` | Boolean | Whether to enable checking if the current sequence number is still available in a particular Kinesis shard. If set to false, the indexing task will attempt to reset the current sequence number (or not), depending on the value of `resetOffsetAutomatically`. | no (default == false) | | `workerThreads` | Integer | The number of threads that the supervisor uses to handle requests/responses for worker tasks, along with any other internal asynchronous operation. | no (default == min(10, taskCount)) | | `chatThreads` | Integer | The number of threads that will be used for communicating with indexing tasks. | no (default == min(10, taskCount * replicas)) | @@ -222,12 +222,12 @@ For Concise bitmaps: |Field|Type|Description|Required| |-----|----|-----------|--------| -|`type`|String|See [Additional Peon Configuration: SegmentWriteOutMediumFactory](../../configuration/index.html#segmentwriteoutmediumfactory) for explanation and available options.|yes| +|`type`|String|See [Additional Peon Configuration: SegmentWriteOutMediumFactory](../../configuration/index.md#segmentwriteoutmediumfactory) for explanation and available options.|yes| ## Operations This section gives descriptions of how some supervisor APIs work specifically in Kinesis Indexing Service. -For all supervisor APIs, please check [Supervisor APIs](../../operations/api-reference.html#supervisors). +For all supervisor APIs, please check [Supervisor APIs](../../operations/api-reference.md#supervisors). ### AWS Authentication To authenticate with AWS, you must provide your AWS access key and AWS secret key via runtime.properties, for example: diff --git a/docs/development/extensions-core/lookups-cached-global.md b/docs/development/extensions-core/lookups-cached-global.md index 9ce6997c916a..7bef143e80a6 100644 --- a/docs/development/extensions-core/lookups-cached-global.md +++ b/docs/development/extensions-core/lookups-cached-global.md @@ -86,7 +86,7 @@ The parameters are as follows |--------|-----------|--------|-------| |`extractionNamespace`|Specifies how to populate the local cache. See below|Yes|-| |`firstCacheTimeout`|How long to wait (in ms) for the first run of the cache to populate. 0 indicates to not wait|No|`0` (do not wait)| -|`injective`|If the underlying map is [injective](../../querying/lookups.html#query-execution) (keys and values are unique) then optimizations can occur internally by setting this to `true`|No|`false`| +|`injective`|If the underlying map is [injective](../../querying/lookups.md#query-execution) (keys and values are unique) then optimizations can occur internally by setting this to `true`|No|`false`| If `firstCacheTimeout` is set to a non-zero value, it should be less than `druid.manager.lookups.hostUpdateTimeout`. If `firstCacheTimeout` is NOT set, then management is essentially asynchronous and does not know if a lookup succeeded or failed in starting. In such a case logs from the processes using lookups should be monitored for repeated failures. @@ -95,7 +95,7 @@ Proper functionality of globally cached lookups requires the following extension ## Example configuration -In a simple case where only one [tier](../../querying/lookups.html#dynamic-configuration) exists (`realtime_customer2`) with one `cachedNamespace` lookup called `country_code`, the resulting configuration JSON looks similar to the following: +In a simple case where only one [tier](../../querying/lookups.md#dynamic-configuration) exists (`realtime_customer2`) with one `cachedNamespace` lookup called `country_code`, the resulting configuration JSON looks similar to the following: ```json { diff --git a/docs/development/extensions-core/s3.md b/docs/development/extensions-core/s3.md index b30ba4ce83ab..1689e1ab9043 100644 --- a/docs/development/extensions-core/s3.md +++ b/docs/development/extensions-core/s3.md @@ -76,7 +76,7 @@ Druid uses the following credentials provider chain to connect to your S3 bucket |6|ECS container credentials|Based on environment variables available on AWS ECS (AWS_CONTAINER_CREDENTIALS_RELATIVE_URI or AWS_CONTAINER_CREDENTIALS_FULL_URI) as described in the [EC2ContainerCredentialsProviderWrapper documentation](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/EC2ContainerCredentialsProviderWrapper.html)| |7|Instance profile information|Based on the instance profile you may have attached to your druid instance| -You can find more information about authentication method [here](https://docs.aws.amazon.com/fr_fr/sdk-for-java/v1/developer-guide/credentials.html)
+You can find more information about authentication method [here](https://docs.aws.amazon.com/fr_fr/sdk-for-java/v1/developer-guide/credentials)
**Note :** *Order is important here as it indicates the precedence of authentication methods.
So if you are trying to use Instance profile information, you **must not** set `druid.s3.accessKey` and `druid.s3.secretKey` in your Druid runtime.properties* @@ -118,9 +118,9 @@ As an example, to set the region to 'us-east-1' through system properties: ## Server-side encryption -You can enable [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html) by setting +You can enable [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption) by setting `druid.storage.sse.type` to a supported type of server-side encryption. The current supported types are: -- s3: [Server-side encryption with S3-managed encryption keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html) -- kms: [Server-side encryption with AWS KMS–Managed Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html) -- custom: [Server-side encryption with Customer-Provided Encryption Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/ServerSideEncryptionCustomerKeys.html) +- s3: [Server-side encryption with S3-managed encryption keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption) +- kms: [Server-side encryption with AWS KMS–Managed Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption) +- custom: [Server-side encryption with Customer-Provided Encryption Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/ServerSideEncryptionCustomerKeys) diff --git a/docs/development/extensions-core/simple-client-sslcontext.md b/docs/development/extensions-core/simple-client-sslcontext.md index 7452a7b5bb1b..1096e1fa4a13 100644 --- a/docs/development/extensions-core/simple-client-sslcontext.md +++ b/docs/development/extensions-core/simple-client-sslcontext.md @@ -23,9 +23,9 @@ title: "Simple SSLContext Provider Module" --> -This Apache Druid module contains a simple implementation of [SSLContext](http://docs.oracle.com/javase/8/docs/api/javax/net/ssl/SSLContext.html) +This Apache Druid module contains a simple implementation of [SSLContext](http://docs.oracle.com/javase/8/docs/api/javax/net/ssl/SSLContext) that will be injected to be used with HttpClient that Druid processes use internally to communicate with each other. To learn more about -Java's SSL support, please refer to [this](http://docs.oracle.com/javase/8/docs/technotes/guides/security/jsse/JSSERefGuide.html) guide. +Java's SSL support, please refer to [this](http://docs.oracle.com/javase/8/docs/technotes/guides/security/jsse/JSSERefGuide) guide. |Property|Description|Default|Required| @@ -48,5 +48,5 @@ The following table contains optional parameters for supporting client certifica |`druid.client.https.keyManagerPassword`|The [Password Provider](../../operations/password-provider.md) or String password for the Key Manager.|none|no| |`druid.client.https.validateHostnames`|Validate the hostname of the server. This should not be disabled unless you are using [custom TLS certificate checks](../../operations/tls-support.md) and know that standard hostname validation is not needed.|true|no| -This [document](http://docs.oracle.com/javase/8/docs/technotes/guides/security/StandardNames.html) lists all the possible +This [document](http://docs.oracle.com/javase/8/docs/technotes/guides/security/StandardNames) lists all the possible values for the above mentioned configs among others provided by Java implementation. diff --git a/docs/development/javascript.md b/docs/development/javascript.md index c05acc9e62b7..ad496595b61b 100644 --- a/docs/development/javascript.md +++ b/docs/development/javascript.md @@ -30,13 +30,13 @@ This page discusses how to use JavaScript to extend Apache Druid. JavaScript can be used to extend Druid in a variety of ways: -- [Aggregators](../querying/aggregations.html#javascript-aggregator) -- [Extraction functions](../querying/dimensionspecs.html#javascript-extraction-function) -- [Filters](../querying/filters.html#javascript-filter) -- [Post-aggregators](../querying/post-aggregations.html#javascript-post-aggregator) -- [Input parsers](../ingestion/data-formats.html#javascript-parsespec) -- [Router strategy](../design/router.html#javascript) -- [Worker select strategy](../configuration/index.html#javascript-worker-select-strategy) +- [Aggregators](../querying/aggregations.md#javascript-aggregator) +- [Extraction functions](../querying/dimensionspecs.md#javascript-extraction-function) +- [Filters](../querying/filters.md#javascript-filter) +- [Post-aggregators](../querying/post-aggregations.md#javascript-post-aggregator) +- [Input parsers](../ingestion/data-formats.md#javascript-parsespec) +- [Router strategy](../design/router.md#javascript) +- [Worker select strategy](../configuration/index.md#javascript-worker-select-strategy) JavaScript can be injected dynamically at runtime, making it convenient to rapidly prototype new functionality without needing to write and deploy Druid extensions. @@ -48,7 +48,7 @@ Druid uses the Mozilla Rhino engine at optimization level 9 to compile and execu Druid does not execute JavaScript functions in a sandbox, so they have full access to the machine. So JavaScript functions allow users to execute arbitrary code inside druid process. So, by default, JavaScript is disabled. However, on dev/staging environments or secured production environments you can enable those by setting -the [configuration property](../configuration/index.html#javascript) +the [configuration property](../configuration/index.md#javascript) `druid.javascript.enabled = true`. ## Global variables diff --git a/docs/development/overview.md b/docs/development/overview.md index 5ff77af07cf4..b3b5c8e0c751 100644 --- a/docs/development/overview.md +++ b/docs/development/overview.md @@ -72,5 +72,5 @@ At some point in the future, we will likely move the internal UI code out of cor ## Client libraries We welcome contributions for new client libraries to interact with Druid. See the -[Community and third-party libraries](https://druid.apache.org/libraries.html) page for links to existing client +[Community and third-party libraries](https://druid.apache.org/libraries.md) page for links to existing client libraries. diff --git a/docs/ingestion/data-formats.md b/docs/ingestion/data-formats.md index 21adee13a106..2c667817d92e 100644 --- a/docs/ingestion/data-formats.md +++ b/docs/ingestion/data-formats.md @@ -329,7 +329,7 @@ and [Kinesis indexing service](../development/extensions-core/kinesis-ingestion. Consider using the [input format](#input-format) instead for these types of ingestion. This section lists all default and core extension parsers. -For community extension parsers, please see our [community extensions list](../development/extensions.html#community-extensions). +For community extension parsers, please see our [community extensions list](../development/extensions.md#community-extensions). ### String Parser @@ -423,7 +423,7 @@ the set of ingested dimensions, if missing the discovered fields will make up th `timeAndDims` parse spec must specify which fields will be extracted as dimensions through the `dimensionSpec`. -[All column types](https://orc.apache.org/docs/types.html) are supported, with the exception of `union` types. Columns of +[All column types](https://orc.apache.org/docs/types.md) are supported, with the exception of `union` types. Columns of `list` type, if filled with primitives, may be used as a multi-value dimension, or specific elements can be extracted with `flattenSpec` expressions. Likewise, primitive fields may be extracted from `map` and `struct` types in the same manner. Auto field discovery will automatically create a string dimension for every (non-timestamp) primitive or `list` of @@ -658,7 +658,7 @@ JSON path expressions for all supported types. When the time dimension is a [DateType column](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md), a format should not be supplied. When the format is UTF8 (String), either `auto` or a explicitly defined -[format](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html) is required. +[format](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat) is required. #### Parquet Hadoop Parser vs Parquet Avro Hadoop Parser @@ -808,7 +808,7 @@ Note that the `int96` Parquet value type is not supported with this parser. When the time dimension is a [DateType column](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md), a format should not be supplied. When the format is UTF8 (String), either `auto` or -an explicitly defined [format](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html) is required. +an explicitly defined [format](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat) is required. #### Example diff --git a/docs/ingestion/data-management.md b/docs/ingestion/data-management.md index f4c17926e44d..2d04fb2d856b 100644 --- a/docs/ingestion/data-management.md +++ b/docs/ingestion/data-management.md @@ -139,7 +139,7 @@ An example of compaction task is This compaction task reads _all segments_ of the interval `2017-01-01/2018-01-01` and results in new segments. Since `segmentGranularity` is null, the original segment granularity will be remained and not changed after compaction. -To control the number of result segments per time chunk, you can set [maxRowsPerSegment](../configuration/index.html#compaction-dynamic-configuration) or [numShards](../ingestion/native-batch.md#tuningconfig). +To control the number of result segments per time chunk, you can set [maxRowsPerSegment](../configuration/index.md#compaction-dynamic-configuration) or [numShards](../ingestion/native-batch.md#tuningconfig). Please note that you can run multiple compactionTasks at the same time. For example, you can run 12 compactionTasks per month instead of running a single task for the entire year. A compaction task internally generates an `index` task spec for performing compaction work with some fixed parameters. @@ -159,8 +159,8 @@ In this case, the dimensions of recent segments precede that of old segments in This is because more recent segments are more likely to have the new desired order and data types. If you want to use your own ordering and types, you can specify a custom `dimensionsSpec` in the compaction task spec. - Roll-up: the output segment is rolled up only when `rollup` is set for all input segments. -See [Roll-up](../ingestion/index.html#rollup) for more details. -You can check that your segments are rolled up or not by using [Segment Metadata Queries](../querying/segmentmetadataquery.html#analysistypes). +See [Roll-up](../ingestion/index.md#rollup) for more details. +You can check that your segments are rolled up or not by using [Segment Metadata Queries](../querying/segmentmetadataquery.md#analysistypes). ### Compaction IOConfig @@ -243,11 +243,11 @@ scenarios dealing with more than 1GB of data. ## Deleting data Druid supports permanent deletion of segments that are in an "unused" state (see the -[Segment lifecycle](../design/architecture.html#segment-lifecycle) section of the Architecture page). +[Segment lifecycle](../design/architecture.md#segment-lifecycle) section of the Architecture page). The Kill Task deletes unused segments within a specified interval from metadata storage and deep storage. -For more information, please see [Kill Task](../ingestion/tasks.html#kill). +For more information, please see [Kill Task](../ingestion/tasks.md#kill). Permanent deletion of a segment in Apache Druid has two steps: @@ -257,7 +257,7 @@ Permanent deletion of a segment in Apache Druid has two steps: For documentation on retention rules, please see [Data Retention](../operations/rule-configuration.md). For documentation on disabling segments using the Coordinator API, please see the -[Coordinator Datasources API](../operations/api-reference.html#coordinator-datasources) reference. +[Coordinator Datasources API](../operations/api-reference.md#coordinator-datasources) reference. A data deletion tutorial is available at [Tutorial: Deleting data](../tutorials/tutorial-delete-data.md) diff --git a/docs/ingestion/hadoop.md b/docs/ingestion/hadoop.md index 088bdceee76d..2e81bc04e615 100644 --- a/docs/ingestion/hadoop.md +++ b/docs/ingestion/hadoop.md @@ -28,7 +28,7 @@ instance of a Druid [Overlord](../design/overlord.md). Please refer to our [Hado comparisons between Hadoop-based, native batch (simple), and native batch (parallel) ingestion. To run a Hadoop-based ingestion task, write an ingestion spec as specified below. Then POST it to the -[`/druid/indexer/v1/task`](../operations/api-reference.html#tasks) endpoint on the Overlord, or use the +[`/druid/indexer/v1/task`](../operations/api-reference.md#tasks) endpoint on the Overlord, or use the `bin/post-index-task` script included with Druid. ## Tutorial @@ -111,7 +111,7 @@ A sample task is shown below: |hadoopDependencyCoordinates|A JSON array of Hadoop dependency coordinates that Druid will use, this property will override the default Hadoop coordinates. Once specified, Druid will look for those Hadoop dependencies from the location specified by `druid.extensions.hadoopDependenciesDir`|no| |classpathPrefix|Classpath that will be prepended for the Peon process.|no| -Also note that Druid automatically computes the classpath for Hadoop job containers that run in the Hadoop cluster. But in case of conflicts between Hadoop and Druid's dependencies, you can manually specify the classpath by setting `druid.extensions.hadoopContainerDruidClasspath` property. See the extensions config in [base druid configuration](../configuration/index.html#extensions). +Also note that Druid automatically computes the classpath for Hadoop job containers that run in the Hadoop cluster. But in case of conflicts between Hadoop and Druid's dependencies, you can manually specify the classpath by setting `druid.extensions.hadoopContainerDruidClasspath` property. See the extensions config in [base druid configuration](../configuration/index.md#extensions). ## `dataSchema` @@ -150,7 +150,7 @@ For example, using the static input paths: You can also read from cloud storage such as AWS S3 or Google Cloud Storage. To do so, you need to install the necessary library under Druid's classpath in _all MiddleManager or Indexer processes_. -For S3, you can run the below command to install the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). +For S3, you can run the below command to install the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/). ```bash java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}"; @@ -159,7 +159,7 @@ cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${H Once you install the Hadoop AWS module in all MiddleManager and Indexer processes, you can put your S3 paths in the inputSpec with the below job properties. -For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). +For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/). ``` "paths" : "s3a://billy-bucket/the/data/is/here/data.gz,s3a://billy-bucket/the/data/is/here/moredata.gz,s3a://billy-bucket/the/data/is/here/evenmoredata.gz" @@ -179,7 +179,7 @@ under `${DRUID_HOME}/hadoop-dependencies` in _all MiddleManager or Indexer proce Once you install the GCS Connector jar in all MiddleManager and Indexer processes, you can put your Google Cloud Storage paths in the inputSpec with the below job properties. For more configurations, see the [instructions to configure Hadoop](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md#configure-hadoop), -[GCS core default](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/conf/gcs-core-default.xml) +[GCS core default](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v2.0.0/gcs/conf/gcs-core-default.xml) and [GCS core template](https://github.com/GoogleCloudPlatform/bdutil/blob/master/conf/hadoop2/gcs-core-template.xml). ``` @@ -438,7 +438,7 @@ If you are having dependency problems with your version of Hadoop and the versio If your cluster is running on Amazon Web Services, you can use Elastic MapReduce (EMR) to index data from S3. To do this: -- Create a persistent, [long-running cluster](http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-longrunning-transient.html). +- Create a persistent, [long-running cluster](http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-longrunning-transient). - When creating your cluster, enter the following configuration. If you're using the wizard, this should be in advanced mode under "Edit software settings": diff --git a/docs/ingestion/index.md b/docs/ingestion/index.md index 0925675f9200..61ad679007b9 100644 --- a/docs/ingestion/index.md +++ b/docs/ingestion/index.md @@ -81,7 +81,7 @@ use the cluster resource of the existing cluster for batch ingestion. This table compares the three available options: -| **Method** | [Native batch (parallel)](native-batch.html#parallel-task) | [Hadoop-based](hadoop.html) | [Native batch (simple)](native-batch.html#simple-task) | +| **Method** | [Native batch (parallel)](native-batch.md#parallel-task) | [Hadoop-based](hadoop.md) | [Native batch (simple)](native-batch.md#simple-task) | |---|-----|--------------|------------| | **Task type** | `index_parallel` | `index_hadoop` | `index` | | **Parallel?** | Yes, if `inputFormat` is splittable and `maxNumConcurrentSubTasks` > 1 in `tuningConfig`. See [data format documentation](./data-formats.md) for details. | Yes, always. | No. Each task is single-threaded. | @@ -106,7 +106,7 @@ offers a unique data modeling system that bears similarity to both relational an Druid schemas must always include a primary timestamp. The primary timestamp is used for [partitioning and sorting](#partitioning) your data. Druid queries are able to rapidly identify and retrieve data corresponding to time ranges of the primary timestamp column. Druid is also able to use the primary timestamp column -for time-based [data management operations](data-management.html) such as dropping time chunks, overwriting time chunks, +for time-based [data management operations](data-management.md) such as dropping time chunks, overwriting time chunks, and time-based retention rules. The primary timestamp is parsed based on the [`timestampSpec`](#timestampspec). In addition, the @@ -186,7 +186,7 @@ Tips for maximizing rollup: - Generally, the fewer dimensions you have, and the lower the cardinality of your dimensions, the better rollup ratios you will achieve. -- Use [sketches](schema-design.html#sketches) to avoid storing high cardinality dimensions, which harm rollup ratios. +- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality dimensions, which harm rollup ratios. - Adjusting `queryGranularity` at ingestion time (for example, using `PT5M` instead of `PT1M`) increases the likelihood of two rows in Druid having matching timestamps, and can improve your rollup ratios. - It can be beneficial to load the same data into more than one Druid datasource. Some users choose to create a "full" @@ -218,8 +218,8 @@ The following table shows how each method handles rollup: |Method|How it works| |------|------------| -|[Native batch](native-batch.html)|`index_parallel` and `index` type may be either perfect or best-effort, based on configuration.| -|[Hadoop](hadoop.html)|Always perfect.| +|[Native batch](native-batch.md)|`index_parallel` and `index` type may be either perfect or best-effort, based on configuration.| +|[Hadoop](hadoop.md)|Always perfect.| |[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Always best-effort.| |[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Always best-effort.| @@ -258,7 +258,7 @@ storage size decreases - and it also tends to improve query performance as well. Not all ingestion methods support an explicit partitioning configuration, and not all have equivalent levels of flexibility. As of current Druid versions, If you are doing initial ingestion through a less-flexible method (like -Kafka) then you can use [reindexing techniques](data-management.html#compaction-and-reindexing) to repartition your data after it +Kafka) then you can use [reindexing techniques](data-management.md#compaction-and-reindexing) to repartition your data after it is initially ingested. This is a powerful technique: you can use it to ensure that any data older than a certain threshold is optimally partitioned, even as you continuously add new data from a stream. @@ -266,10 +266,10 @@ The following table shows how each ingestion method handles partitioning: |Method|How it works| |------|------------| -|[Native batch](native-batch.html)|Configured using [`partitionsSpec`](native-batch.html#partitionsspec) inside the `tuningConfig`.| -|[Hadoop](hadoop.html)|Configured using [`partitionsSpec`](hadoop.html#partitionsspec) inside the `tuningConfig`.| -|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Partitioning in Druid is guided by how your Kafka topic is partitioned. You can also [reindex](data-management.html#compaction-and-reindexing) to repartition after initial ingestion.| -|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Partitioning in Druid is guided by how your Kinesis stream is sharded. You can also [reindex](data-management.html#compaction-and-reindexing) to repartition after initial ingestion.| +|[Native batch](native-batch.md)|Configured using [`partitionsSpec`](native-batch.md#partitionsspec) inside the `tuningConfig`.| +|[Hadoop](hadoop.md)|Configured using [`partitionsSpec`](hadoop.md#partitionsspec) inside the `tuningConfig`.| +|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Partitioning in Druid is guided by how your Kafka topic is partitioned. You can also [reindex](data-management.md#compaction-and-reindexing) to repartition after initial ingestion.| +|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Partitioning in Druid is guided by how your Kinesis stream is sharded. You can also [reindex](data-management.md#compaction-and-reindexing) to repartition after initial ingestion.| > Note that, of course, one way to partition data is to load it into separate datasources. This is a perfectly viable > approach and works very well when the number of datasources does not lead to excessive per-datasource overheads. If @@ -283,7 +283,7 @@ The following table shows how each ingestion method handles partitioning: ## Ingestion specs -No matter what ingestion method you use, data is loaded into Druid using either one-time [tasks](tasks.html) or +No matter what ingestion method you use, data is loaded into Druid using either one-time [tasks](tasks.md) or ongoing "supervisors" (which run and supervise a set of tasks over time). In any case, part of the task or supervisor definition is an _ingestion spec_. @@ -359,7 +359,7 @@ You can also load data visually, without the need to write an ingestion spec, us available in Druid's [web console](../operations/druid-console.md). Druid's visual data loader supports [Kafka](../development/extensions-core/kafka-ingestion.md), [Kinesis](../development/extensions-core/kinesis-ingestion.md), and -[native batch](native-batch.html) mode. +[native batch](native-batch.md) mode. ## `dataSchema` @@ -406,7 +406,7 @@ An example `dataSchema` is: ### `dataSource` The `dataSource` is located in `dataSchema` → `dataSource` and is simply the name of the -[datasource](../design/architecture.html#datasources-and-segments) that data will be written to. An example +[datasource](../design/architecture.md#datasources-and-segments) that data will be written to. An example `dataSource` is: ``` @@ -526,7 +526,7 @@ An example `metricsSpec` is: The `granularitySpec` is located in `dataSchema` → `granularitySpec` and is responsible for configuring the following operations: -1. Partitioning a datasource into [time chunks](../design/architecture.html#datasources-and-segments) (via `segmentGranularity`). +1. Partitioning a datasource into [time chunks](../design/architecture.md#datasources-and-segments) (via `segmentGranularity`). 2. Truncating the timestamp, if desired (via `queryGranularity`). 3. Specifying which time chunks of segments should be created, for batch ingestion (via `intervals`). 4. Specifying whether ingestion-time [rollup](#rollup) should be used or not (via `rollup`). @@ -551,7 +551,7 @@ A `granularitySpec` can have the following components: | Field | Description | Default | |-------|-------------|---------| | type | Either `uniform` or `arbitrary`. In most cases you want to use `uniform`.| `uniform` | -| segmentGranularity | [Time chunking](../design/architecture.html#datasources-and-segments) granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to `day`, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any [granularity](../querying/granularities.md) can be provided here. Note that all segments in the same time chunk should have the same segment granularity.

Ignored if `type` is set to `arbitrary`.| `day` | +| segmentGranularity | [Time chunking](../design/architecture.md#datasources-and-segments) granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to `day`, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any [granularity](../querying/granularities.md) can be provided here. Note that all segments in the same time chunk should have the same segment granularity.

Ignored if `type` is set to `arbitrary`.| `day` | | queryGranularity | The resolution of timestamp storage within each segment. This must be equal to, or finer, than `segmentGranularity`. This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of `minute` will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc).

Any [granularity](../querying/granularities.md) can be provided here. Use `none` to store timestamps as-is, without any truncation. Note that `rollup` will be applied if it is set even when the `queryGranularity` is set to `none`. | `none` | | rollup | Whether to use ingestion-time [rollup](#rollup) or not. Note that rollup is still effective even when `queryGranularity` is set to `none`. Your data will be rolled up if they have the exactly same timestamp. | `true` | | intervals | A list of intervals describing what time chunks of segments should be created. If `type` is set to `uniform`, this list will be broken up and rounded-off based on the `segmentGranularity`. If `type` is set to `arbitrary`, this list will be used as-is.

If `null` or not provided, batch ingestion tasks will generally determine which time chunks to output based on what timestamps are found in the input data.

If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks will throw away any records with timestamps outside of the specified intervals.

Ignored for any form of streaming ingestion. | `null` | diff --git a/docs/ingestion/native-batch.md b/docs/ingestion/native-batch.md index e1d29304037c..409c9968ad0a 100644 --- a/docs/ingestion/native-batch.md +++ b/docs/ingestion/native-batch.md @@ -1264,7 +1264,7 @@ Sample spec: |property|description|required?| |--------|-----------|---------| |type|This should be "local".|yes| -|filter|A wildcard filter for files. See [here](http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter.html) for more information.|yes if `baseDir` is specified| +|filter|A wildcard filter for files. See [here](http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter) for more information.|yes if `baseDir` is specified| |baseDir|Directory to search recursively for files to be ingested. Empty files under the `baseDir` will be skipped.|At least one of `baseDir` or `files` should be specified| |files|File paths to ingest. Some files can be ignored to avoid ingesting duplicate files if they are located under the specified `baseDir`. Empty files will be skipped.|At least one of `baseDir` or `files` should be specified| @@ -1569,7 +1569,7 @@ A sample local Firehose spec is shown below: |property|description|required?| |--------|-----------|---------| |type|This should be "local".|yes| -|filter|A wildcard filter for files. See [here](http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter.html) for more information.|yes| +|filter|A wildcard filter for files. See [here](http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter) for more information.|yes| |baseDir|directory to search recursively for files to be ingested. |yes| diff --git a/docs/ingestion/schema-design.md b/docs/ingestion/schema-design.md index 7ecc95a81563..a1a68e49bf6b 100644 --- a/docs/ingestion/schema-design.md +++ b/docs/ingestion/schema-design.md @@ -89,7 +89,7 @@ it is a natural choice for storing timeseries data. Its flexible data model allo non-timeseries data, even in the same datasource. To achieve best-case compression and query performance in Druid for timeseries data, it is important to partition and -sort by metric name, like timeseries databases often do. See [Partitioning and sorting](index.html#partitioning) for more details. +sort by metric name, like timeseries databases often do. See [Partitioning and sorting](index.md#partitioning) for more details. Tips for modeling timeseries data in Druid: @@ -98,12 +98,12 @@ for ingestion and aggregation. - Create a dimension that indicates the name of the series that a data point belongs to. This dimension is often called "metric" or "name". Do not get the dimension named "metric" confused with the concept of Druid metrics. Place this first in the list of dimensions in your "dimensionsSpec" for best performance (this helps because it improves locality; -see [partitioning and sorting](index.html#partitioning) below for details). +see [partitioning and sorting](index.md#partitioning) below for details). - Create other dimensions for attributes attached to your data points. These are often called "tags" in timeseries database systems. - Create [metrics](../querying/aggregations.md) corresponding to the types of aggregations that you want to be able to query. Typically this includes "sum", "min", and "max" (in one of the long, float, or double flavors). If you want to -be able to compute percentiles or quantiles, use Druid's [approximate aggregators](../querying/aggregations.html#approx). +be able to compute percentiles or quantiles, use Druid's [approximate aggregators](../querying/aggregations.md#approx). - Consider enabling [rollup](#rollup), which will allow Druid to potentially combine multiple points into one row in your Druid datasource. This can be useful if you want to store data at a different time granularity than it is naturally emitted. It is also useful if you want to combine timeseries and non-timeseries data in the same datasource. @@ -167,7 +167,7 @@ so they can be sorted and the quantile can be computed, Druid instead only needs can reduce data transfer needs to mere kilobytes. For details about the sketches available in Druid, see the -[approximate aggregators](../querying/aggregations.html#approx) page. +[approximate aggregators](../querying/aggregations.md#approx) page. If you prefer videos, take a look at [Not exactly!](https://www.youtube.com/watch?v=Hpd3f_MLdXo), a conference talk about sketches in Druid. @@ -187,7 +187,7 @@ For details about how to configure numeric dimensions, see the [`dimensionsSpec` ### Secondary timestamps Druid schemas must always include a primary timestamp. The primary timestamp is used for -[partitioning and sorting](index.html#partitioning) your data, so it should be the timestamp that you will most often filter on. +[partitioning and sorting](index.md#partitioning) your data, so it should be the timestamp that you will most often filter on. Druid is able to rapidly identify and retrieve data corresponding to time ranges of the primary timestamp column. If your data has more than one timestamp, you can ingest the others as secondary timestamps. The best way to do this @@ -195,7 +195,7 @@ is to ingest them as [long-typed dimensions](index.md#dimensionsspec) in millise If necessary, you can get them into this format using a [`transformSpec`](index.md#transformspec) and [expressions](../misc/math-expr.md) like `timestamp_parse`, which returns millisecond timestamps. -At query time, you can query secondary timestamps with [SQL time functions](../querying/sql.html#time-functions) +At query time, you can query secondary timestamps with [SQL time functions](../querying/sql.md#time-functions) like `MILLIS_TO_TIMESTAMP`, `TIME_FLOOR`, and others. If you're using native Druid queries, you can use [expressions](../misc/math-expr.md). diff --git a/docs/ingestion/standalone-realtime.md b/docs/ingestion/standalone-realtime.md index 62f31259964b..b8ba92a8211c 100644 --- a/docs/ingestion/standalone-realtime.md +++ b/docs/ingestion/standalone-realtime.md @@ -24,7 +24,7 @@ title: "Realtime Process" Older versions of Apache Druid supported a standalone 'Realtime' process to query and index 'stream pull' modes of real-time ingestion. These processes would periodically build segments for the data they had collected over -some span of time and then set up hand-off to [Historical](../design/historical.html) servers. +some span of time and then set up hand-off to [Historical](../design/historical.md) servers. This processes could be invoked by @@ -40,5 +40,5 @@ suffered from limitations which made it not possible to achieve exactly once ing The extensions `druid-kafka-eight`, `druid-kafka-eight-simpleConsumer`, `druid-rabbitmq`, and `druid-rocketmq` were also removed at this time, since they were built to operate on the realtime nodes. -Please consider using the [Kafka Indexing Service](../development/extensions-core/kafka-ingestion.html) or +Please consider using the [Kafka Indexing Service](../development/extensions-core/kafka-ingestion.md) or [Kinesis Indexing Service](../development/extensions-core/kinesis-ingestion.md) for stream pull ingestion instead. diff --git a/docs/ingestion/tasks.md b/docs/ingestion/tasks.md index a03162b19314..4fc21d37ca79 100644 --- a/docs/ingestion/tasks.md +++ b/docs/ingestion/tasks.md @@ -250,7 +250,7 @@ Note that the overshadow relation holds only for the same time chunk and the sam These overshadowed segments are not considered in query processing to filter out stale data. Each segment has a _major_ version and a _minor_ version. The major version is -represented as a timestamp in the format of [`"yyyy-MM-dd'T'hh:mm:ss"`](https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html) +represented as a timestamp in the format of [`"yyyy-MM-dd'T'hh:mm:ss"`](https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat) while the minor version is an integer number. These major and minor versions are used to determine the overshadow relation between segments as seen below. @@ -268,7 +268,7 @@ Here are some examples. ## Locking -If you are running two or more [druid tasks](./tasks.html) which generate segments for the same data source and the same time chunk, +If you are running two or more [druid tasks](./tasks.md) which generate segments for the same data source and the same time chunk, the generated segments could potentially overshadow each other, which could lead to incorrect query results. To avoid this problem, tasks will attempt to get locks prior to creating any segment in Druid. @@ -297,7 +297,7 @@ Also, the segment locking is supported by only native indexing tasks and Kafka/K Hadoop indexing tasks and `index_realtime` tasks (used by [Tranquility](tranquility.md)) don't support it yet. `forceTimeChunkLock` in the task context is only applied to individual tasks. -If you want to unset it for all tasks, you would want to set `druid.indexer.tasklock.forceTimeChunkLock` to false in the [overlord configuration](../configuration/index.html#overlord-operations). +If you want to unset it for all tasks, you would want to set `druid.indexer.tasklock.forceTimeChunkLock` to false in the [overlord configuration](../configuration/index.md#overlord-operations). Lock requests can conflict with each other if two or more tasks try to get locks for the overlapped time chunks of the same data source. Note that the lock conflict can happen between different locks types. @@ -348,7 +348,7 @@ The task context is used for various individual task configuration. The followin |property|default|description| |--------|-------|-----------| |taskLockTimeout|300000|task lock timeout in millisecond. For more details, see [Locking](#locking).| -|forceTimeChunkLock|true|_Setting this to false is still experimental_
Force to always use time chunk lock. If not set, each task automatically chooses a lock type to use. If this set, it will overwrite the `druid.indexer.tasklock.forceTimeChunkLock` [configuration for the overlord](../configuration/index.html#overlord-operations). See [Locking](#locking) for more details.| +|forceTimeChunkLock|true|_Setting this to false is still experimental_
Force to always use time chunk lock. If not set, each task automatically chooses a lock type to use. If this set, it will overwrite the `druid.indexer.tasklock.forceTimeChunkLock` [configuration for the overlord](../configuration/index.md#overlord-operations). See [Locking](#locking) for more details.| |priority|Different based on task types. See [Priority](#priority).|Task priority| > When a task acquires a lock, it sends a request via HTTP and awaits until it receives a response containing the lock acquisition result. diff --git a/docs/misc/math-expr.md b/docs/misc/math-expr.md index d3ef3b727816..916d9cb827f4 100644 --- a/docs/misc/math-expr.md +++ b/docs/misc/math-expr.md @@ -50,7 +50,7 @@ Expressions can contain variables. Variable names may contain letters, digits, ' For logical operators, a number is true if and only if it is positive (0 or negative value means false). For string type, it's the evaluation result of 'Boolean.valueOf(string)'. -[Multi-value string dimensions](../querying/multi-value-dimensions.html) are supported and may be treated as either scalar or array typed values. When treated as a scalar type, an expression will automatically be transformed to apply the scalar operation across all values of the multi-valued type, to mimic Druid's native behavior. Values that result in arrays will be coerced back into the native Druid string type for aggregation. Druid aggregations on multi-value string dimensions on the individual values, _not_ the 'array', behaving similar to the `UNNEST` operator available in many SQL dialects. However, by using the `array_to_string` function, aggregations may be done on a stringified version of the complete array, allowing the complete row to be preserved. Using `string_to_array` in an expression post-aggregator, allows transforming the stringified dimension back into the true native array type. +[Multi-value string dimensions](../querying/multi-value-dimensions.md) are supported and may be treated as either scalar or array typed values. When treated as a scalar type, an expression will automatically be transformed to apply the scalar operation across all values of the multi-valued type, to mimic Druid's native behavior. Values that result in arrays will be coerced back into the native Druid string type for aggregation. Druid aggregations on multi-value string dimensions on the individual values, _not_ the 'array', behaving similar to the `UNNEST` operator available in many SQL dialects. However, by using the `array_to_string` function, aggregations may be done on a stringified version of the complete array, allowing the complete row to be preserved. Using `string_to_array` in an expression post-aggregator, allows transforming the stringified dimension back into the true native array type. The following built-in functions are available. @@ -72,7 +72,7 @@ The following built-in functions are available. |name|description| |----|-----------| |concat|concat(expr, expr...) concatenate a list of strings| -|format|format(pattern[, args...]) returns a string formatted in the manner of Java's [String.format](https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#format-java.lang.String-java.lang.Object...-).| +|format|format(pattern[, args...]) returns a string formatted in the manner of Java's [String.format](https://docs.oracle.com/javase/8/docs/api/java/lang/String.md#format-java.lang.String-java.lang.Object...-).| |like|like(expr, pattern[, escape]) is equivalent to SQL `expr LIKE pattern`| |lookup|lookup(expr, lookup-name) looks up expr in a registered [query-time lookup](../querying/lookups.md)| |parse_long|parse_long(string[, radix]) parses a string as a long with the given radix, or 10 (decimal) if a radix is not provided.| @@ -106,8 +106,8 @@ The following built-in functions are available. |timestamp_floor|timestamp_floor(expr, period, \[origin, [timezone\]\]) rounds down a timestamp, returning it as a new timestamp. Period can be any ISO8601 period, like P3M (quarters) or PT12H (half-days). The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00".| |timestamp_shift|timestamp_shift(expr, period, step, \[timezone\]) shifts a timestamp by a period (step times), returning it as a new timestamp. Period can be any ISO8601 period. Step may be negative. The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00".| |timestamp_extract|timestamp_extract(expr, unit, \[timezone\]) extracts a time part from expr, returning it as a number. Unit can be EPOCH (number of seconds since 1970-01-01 00:00:00 UTC), SECOND, MINUTE, HOUR, DAY (day of month), DOW (day of week), DOY (day of year), WEEK (week of [week year](https://en.wikipedia.org/wiki/ISO_week_date)), MONTH (1 through 12), QUARTER (1 through 4), or YEAR. The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00"| -|timestamp_parse|timestamp_parse(string expr, \[pattern, [timezone\]\]) parses a string into a timestamp using a given [Joda DateTimeFormat pattern](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html). If the pattern is not provided, this parses time strings in either ISO8601 or SQL format. The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00", and will be used as the time zone for strings that do not include a time zone offset. Pattern and time zone must be literals. Strings that cannot be parsed as timestamps will be returned as nulls.| -|timestamp_format|timestamp_format(expr, \[pattern, \[timezone\]\]) formats a timestamp as a string with a given [Joda DateTimeFormat pattern](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html), or ISO8601 if the pattern is not provided. The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00". Pattern and time zone must be literals.| +|timestamp_parse|timestamp_parse(string expr, \[pattern, [timezone\]\]) parses a string into a timestamp using a given [Joda DateTimeFormat pattern](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat). If the pattern is not provided, this parses time strings in either ISO8601 or SQL format. The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00", and will be used as the time zone for strings that do not include a time zone offset. Pattern and time zone must be literals. Strings that cannot be parsed as timestamps will be returned as nulls.| +|timestamp_format|timestamp_format(expr, \[pattern, \[timezone\]\]) formats a timestamp as a string with a given [Joda DateTimeFormat pattern](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat), or ISO8601 if the pattern is not provided. The time zone, if provided, should be a time zone name like "America/Los_Angeles" or offset like "-08:00". Pattern and time zone must be literals.| ## Math functions diff --git a/docs/operations/api-reference.md b/docs/operations/api-reference.md index 5c6351fc84a4..49ca83d42bbd 100644 --- a/docs/operations/api-reference.md +++ b/docs/operations/api-reference.md @@ -67,7 +67,7 @@ monitoring checks such as AWS load balancer health checks are not able to look a ## Master Server This section documents the API endpoints for the processes that reside on Master servers (Coordinators and Overlords) -in the suggested [three-server configuration](../design/processes.html#server-types). +in the suggested [three-server configuration](../design/processes.md#server-types). ### Coordinator @@ -461,7 +461,7 @@ will be set for them. * `/druid/coordinator/v1/config/compaction` Creates or updates the compaction config for a dataSource. -See [Compaction Configuration](../configuration/index.html#compaction-dynamic-configuration) for configuration details. +See [Compaction Configuration](../configuration/index.md#compaction-dynamic-configuration) for configuration details. ##### DELETE @@ -584,7 +584,7 @@ Retrieve list of task status objects for list of task id strings in request body Manually clean up pending segments table in metadata storage for `datasource`. Returns a JSON object response with `numDeleted` and count of rows deleted from the pending segments table. This API is used by the -`druid.coordinator.kill.pendingSegments.on` [coordinator setting](../configuration/index.html#coordinator-operation) +`druid.coordinator.kill.pendingSegments.on` [coordinator setting](../configuration/index.md#coordinator-operation) which automates this operation to perform periodically. #### Supervisors @@ -602,8 +602,8 @@ Returns a list of objects of the currently active supervisors. |Field|Type|Description| |---|---|---| |`id`|String|supervisor unique identifier| -|`state`|String|basic state of the supervisor. Available states:`UNHEALTHY_SUPERVISOR`, `UNHEALTHY_TASKS`, `PENDING`, `RUNNING`, `SUSPENDED`, `STOPPING`. Check [Kafka Docs](../development/extensions-core/kafka-ingestion.html#operations) for details.| -|`detailedState`|String|supervisor specific state. (See documentation of specific supervisor for details), e.g. [Kafka](../development/extensions-core/kafka-ingestion.html) or [Kinesis](../development/extensions-core/kinesis-ingestion.html))| +|`state`|String|basic state of the supervisor. Available states:`UNHEALTHY_SUPERVISOR`, `UNHEALTHY_TASKS`, `PENDING`, `RUNNING`, `SUSPENDED`, `STOPPING`. Check [Kafka Docs](../development/extensions-core/kafka-ingestion.md#operations) for details.| +|`detailedState`|String|supervisor specific state. (See documentation of specific supervisor for details), e.g. [Kafka](../development/extensions-core/kafka-ingestion.md) or [Kinesis](../development/extensions-core/kinesis-ingestion.md))| |`healthy`|Boolean|true or false indicator of overall supervisor health| |`spec`|SupervisorSpec|json specification of supervisor (See Supervisor Configuration for details)| @@ -614,8 +614,8 @@ Returns a list of objects of the currently active supervisors and their current |Field|Type|Description| |---|---|---| |`id`|String|supervisor unique identifier| -|`state`|String|basic state of the supervisor. Available states: `UNHEALTHY_SUPERVISOR`, `UNHEALTHY_TASKS`, `PENDING`, `RUNNING`, `SUSPENDED`, `STOPPING`. Check [Kafka Docs](../development/extensions-core/kafka-ingestion.html#operations) for details.| -|`detailedState`|String|supervisor specific state. (See documentation of the specific supervisor for details, e.g. [Kafka](../development/extensions-core/kafka-ingestion.html) or [Kinesis](../development/extensions-core/kinesis-ingestion.html))| +|`state`|String|basic state of the supervisor. Available states: `UNHEALTHY_SUPERVISOR`, `UNHEALTHY_TASKS`, `PENDING`, `RUNNING`, `SUSPENDED`, `STOPPING`. Check [Kafka Docs](../development/extensions-core/kafka-ingestion.md#operations) for details.| +|`detailedState`|String|supervisor specific state. (See documentation of the specific supervisor for details, e.g. [Kafka](../development/extensions-core/kafka-ingestion.md) or [Kinesis](../development/extensions-core/kinesis-ingestion.md))| |`healthy`|Boolean|true or false indicator of overall supervisor health| |`suspended`|Boolean|true or false indicator of whether the supervisor is in suspended state| @@ -678,7 +678,7 @@ Shutdown a supervisor. #### Dynamic configuration -See [Overlord Dynamic Configuration](../configuration/index.html#overlord-dynamic-configuration) for details. +See [Overlord Dynamic Configuration](../configuration/index.md#overlord-dynamic-configuration) for details. Note that all _interval_ URL parameters are ISO 8601 strings delimited by a `_` instead of a `/` (e.g., 2016-06-27_2016-06-28). @@ -707,7 +707,7 @@ Update overlord dynamic worker configuration. ## Data Server This section documents the API endpoints for the processes that reside on Data servers (MiddleManagers/Peons and Historicals) -in the suggested [three-server configuration](../design/processes.html#server-types). +in the suggested [three-server configuration](../design/processes.md#server-types). ### MiddleManager @@ -798,7 +798,7 @@ in the local cache have been loaded, and 503 SERVICE UNAVAILABLE, if they haven' ## Query Server -This section documents the API endpoints for the processes that reside on Query servers (Brokers) in the suggested [three-server configuration](../design/processes.html#server-types). +This section documents the API endpoints for the processes that reside on Query servers (Brokers) in the suggested [three-server configuration](../design/processes.md#server-types). ### Broker diff --git a/docs/operations/basic-cluster-tuning.md b/docs/operations/basic-cluster-tuning.md index 5bfc00242cc3..c3413eca1b99 100644 --- a/docs/operations/basic-cluster-tuning.md +++ b/docs/operations/basic-cluster-tuning.md @@ -423,7 +423,7 @@ Additionally, for large JVM heaps, here are a few Garbage Collection efficiency ### Use UTC timezone -We recommend using UTC timezone for all your events and across your hosts, not just for Druid, but for all data infrastructure. This can greatly mitigate potential query problems with inconsistent timezones. To query in a non-UTC timezone see [query granularities](../querying/granularities.html#period-granularities) +We recommend using UTC timezone for all your events and across your hosts, not just for Druid, but for all data infrastructure. This can greatly mitigate potential query problems with inconsistent timezones. To query in a non-UTC timezone see [query granularities](../querying/granularities.md#period-granularities) ### System configuration diff --git a/docs/operations/deep-storage-migration.md b/docs/operations/deep-storage-migration.md index cd21e99f7b8b..733db8bd924b 100644 --- a/docs/operations/deep-storage-migration.md +++ b/docs/operations/deep-storage-migration.md @@ -44,14 +44,14 @@ When migrating from Derby, the coordinator processes will still need to be up in Before migrating, you will need to copy your old segments to the new deep storage. -For information on what path structure to use in the new deep storage, please see [deep storage migration options](../operations/export-metadata.html#deep-storage-migration). +For information on what path structure to use in the new deep storage, please see [deep storage migration options](../operations/export-metadata.md#deep-storage-migration). ## Export segments with rewritten load specs Druid provides an [Export Metadata Tool](../operations/export-metadata.md) for exporting metadata from Derby into CSV files which can then be reimported. -By setting [deep storage migration options](../operations/export-metadata.html#deep-storage-migration), the `export-metadata` tool will export CSV files where the segment load specs have been rewritten to load from your new deep storage location. +By setting [deep storage migration options](../operations/export-metadata.md#deep-storage-migration), the `export-metadata` tool will export CSV files where the segment load specs have been rewritten to load from your new deep storage location. Run the `export-metadata` tool on your existing cluster, using the migration options appropriate for your new deep storage location, and save the CSV files it generates. After a successful export, you can shut down the coordinator. @@ -59,7 +59,7 @@ Run the `export-metadata` tool on your existing cluster, using the migration opt After generating the CSV exports with the modified segment data, you can reimport the contents of the Druid segments table from the generated CSVs. -Please refer to [import commands](../operations/export-metadata.html#importing-metadata) for examples. Only the `druid_segments` table needs to be imported. +Please refer to [import commands](../operations/export-metadata.md#importing-metadata) for examples. Only the `druid_segments` table needs to be imported. ### Restart cluster diff --git a/docs/operations/druid-console.md b/docs/operations/druid-console.md index 2bda0e92250a..8d965c74be2a 100644 --- a/docs/operations/druid-console.md +++ b/docs/operations/druid-console.md @@ -28,7 +28,7 @@ The Druid Console is hosted by the [Router](../design/router.md) process. The following cluster settings must be enabled, as they are by default: -- the Router's [management proxy](../design/router.html#enabling-the-management-proxy) must be enabled. +- the Router's [management proxy](../design/router.md#enabling-the-management-proxy) must be enabled. - the Broker processes in the cluster must have [Druid SQL](../querying/sql.md) enabled. The Druid console can be accessed at: @@ -47,9 +47,9 @@ Below is a description of the high-level features and functionality of the Druid The home view provides a high level overview of the cluster. Each card is clickable and links to the appropriate view. -The legacy menu allows you to go to the [legacy coordinator and overlord consoles](./management-uis.html#legacy-consoles) should you need them. +The legacy menu allows you to go to the [legacy coordinator and overlord consoles](./management-uis.md#legacy-consoles) should you need them. -![home-view](../assets/web-console-01-home-view.png) +![home-view](../assets/web-console-01-home-view.png "home view") ## Data loader diff --git a/docs/operations/high-availability.md b/docs/operations/high-availability.md index 801d50c76014..240a44ca3639 100644 --- a/docs/operations/high-availability.md +++ b/docs/operations/high-availability.md @@ -27,10 +27,10 @@ Apache ZooKeeper, metadata store, the coordinator, the overlord, and brokers are - For highly-available ZooKeeper, you will need a cluster of 3 or 5 ZooKeeper nodes. We recommend either installing ZooKeeper on its own hardware, or running 3 or 5 Master servers (where overlords or coordinators are running) -and configuring ZooKeeper on them appropriately. See the [ZooKeeper admin guide](https://zookeeper.apache.org/doc/current/zookeeperAdmin.html) for more details. +and configuring ZooKeeper on them appropriately. See the [ZooKeeper admin guide](https://zookeeper.apache.org/doc/current/zookeeperAdmin) for more details. - For highly-available metadata storage, we recommend MySQL or PostgreSQL with replication and failover enabled. See [MySQL HA/Scalability Guide](https://dev.mysql.com/doc/mysql-ha-scalability/en/) -and [PostgreSQL's High Availability, Load Balancing, and Replication](https://www.postgresql.org/docs/9.5/high-availability.html) for MySQL and PostgreSQL, respectively. +and [PostgreSQL's High Availability, Load Balancing, and Replication](https://www.postgresql.org/docs/9.5/high-availability) for MySQL and PostgreSQL, respectively. - For highly-available Apache Druid Coordinators and Overlords, we recommend to run multiple servers. If they are all configured to use the same ZooKeeper cluster and metadata storage, then they will automatically failover between each other as necessary. diff --git a/docs/operations/metadata-migration.md b/docs/operations/metadata-migration.md index ef71641c49d7..8e037a649294 100644 --- a/docs/operations/metadata-migration.md +++ b/docs/operations/metadata-migration.md @@ -84,7 +84,7 @@ java -classpath "lib/*" -Dlog4j.configurationFile=conf/druid/cluster/_common/log ### Import metadata -After initializing the tables, please refer to the [import commands](../operations/export-metadata.html#importing-metadata) for your target database. +After initializing the tables, please refer to the [import commands](../operations/export-metadata.md#importing-metadata) for your target database. ### Restart cluster diff --git a/docs/operations/metrics.md b/docs/operations/metrics.md index 933f131a4900..9cf687763fa1 100644 --- a/docs/operations/metrics.md +++ b/docs/operations/metrics.md @@ -252,11 +252,11 @@ These metrics are for the Druid Coordinator and are reset each time the Coordina |`segment/skipCompact/count`|Total number of segments of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.|datasource.|Varies.| |`interval/skipCompact/count`|Total number of intervals of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.|datasource.|Varies.| -If `emitBalancingStats` is set to `true` in the Coordinator [dynamic configuration]( -../configuration/index.html#dynamic-configuration), then [log entries](../configuration/logging.md) for class +If `emitBalancingStats` is set to `true` in the Coordinator [dynamic configuration](../configuration/index.md#dynamic-configuration), then [log entries](../configuration/logging.md) for class `org.apache.druid.server.coordinator.duty.EmitClusterStatsAndMetrics` will have extra information on balancing decisions. + ## General Health ### Historical @@ -287,7 +287,7 @@ These metrics are only available if the JVMMonitor module is included. |`jvm/mem/used`|Used memory.|memKind.|< max memory| |`jvm/mem/committed`|Committed memory.|memKind.|close to max memory| |`jvm/gc/count`|Garbage collection count.|gcName (cms/g1/parallel/etc.), gcGen (old/young)|Varies.| -|`jvm/gc/cpu`|Count of CPU time in Nanoseconds spent on garbage collection. Note: `jvm/gc/cpu` represents the total time over multiple GC cycles; divide by `jvm/gc/count` to get the mean GC time per cycle|gcName, gcGen|Sum of `jvm/gc/cpu` should be within 10-30% of sum of `jvm/cpu/total`, depending on the GC algorithm used (reported by [`JvmCpuMonitor`](../configuration/index.html#enabling-metrics)) | +|`jvm/gc/cpu`|Count of CPU time in Nanoseconds spent on garbage collection. Note: `jvm/gc/cpu` represents the total time over multiple GC cycles; divide by `jvm/gc/count` to get the mean GC time per cycle|gcName, gcGen|Sum of `jvm/gc/cpu` should be within 10-30% of sum of `jvm/cpu/total`, depending on the GC algorithm used (reported by [`JvmCpuMonitor`](../configuration/index.md#enabling-metrics)) | ### EventReceiverFirehose diff --git a/docs/operations/other-hadoop.md b/docs/operations/other-hadoop.md index ce6f0ddd2209..f40b118f96a7 100644 --- a/docs/operations/other-hadoop.md +++ b/docs/operations/other-hadoop.md @@ -87,7 +87,7 @@ classloader. 1. HDFS deep storage uses jars from `extensions/druid-hdfs-storage/` to read and write Druid data on HDFS. 2. Batch ingestion uses jars from `hadoop-dependencies/` to submit Map/Reduce jobs (location customizable via the -`druid.extensions.hadoopDependenciesDir` runtime property; see [Configuration](../configuration/index.html#extensions)). +`druid.extensions.hadoopDependenciesDir` runtime property; see [Configuration](../configuration/index.md#extensions)). `hadoop-client:2.8.5` is the default version of the Hadoop client bundled with Druid for both purposes. This works with many Hadoop distributions (the version does not necessarily need to match), but if you run into issues, you can instead diff --git a/docs/operations/segment-optimization.md b/docs/operations/segment-optimization.md index 9c3b903b74b1..e0e909efb240 100644 --- a/docs/operations/segment-optimization.md +++ b/docs/operations/segment-optimization.md @@ -59,7 +59,7 @@ You may need to consider the followings to optimize your segments. > you may need to find the optimal settings for your workload. There might be several ways to check if the compaction is necessary. One way -is using the [System Schema](../querying/sql.html#system-schema). The +is using the [System Schema](../querying/sql.md#system-schema). The system schema provides several tables about the current system status including the `segments` table. By running the below query, you can get the average number of rows and average size for published segments. @@ -87,11 +87,11 @@ In this case, you may want to see only rows of the max version per interval (pai Once you find your segments need compaction, you can consider the below two options: - - Turning on the [automatic compaction of Coordinators](../design/coordinator.html#compacting-segments). + - Turning on the [automatic compaction of Coordinators](../design/coordinator.md#compacting-segments). The Coordinator periodically submits [compaction tasks](../ingestion/tasks.md#compact) to re-index small segments. To enable the automatic compaction, you need to configure it for each dataSource via Coordinator's dynamic configuration. - See [Compaction Configuration API](../operations/api-reference.html#compaction-configuration) - and [Compaction Configuration](../configuration/index.html#compaction-dynamic-configuration) for details. + See [Compaction Configuration API](../operations/api-reference.md#compaction-configuration) + and [Compaction Configuration](../configuration/index.md#compaction-dynamic-configuration) for details. - Running periodic Hadoop batch ingestion jobs and using a `dataSource` inputSpec to read from the segments generated by the Kafka indexing tasks. This might be helpful if you want to compact a lot of segments in parallel. Details on how to do this can be found on the [Updating existing data](../ingestion/data-management.md#update) section diff --git a/docs/operations/single-server.md b/docs/operations/single-server.md index 9ec769cf7ffc..6ba142228037 100644 --- a/docs/operations/single-server.md +++ b/docs/operations/single-server.md @@ -40,7 +40,7 @@ The other configurations are intended for general use single-machine deployments The startup scripts for these example configurations run a single ZK instance along with the Druid services. You can choose to deploy ZK separately as well. -The example configurations run the Druid Coordinator and Overlord together in a single process using the optional configuration `druid.coordinator.asOverlord.enabled=true`, described in the [Coordinator configuration documentation](../configuration/index.html#coordinator-operation). +The example configurations run the Druid Coordinator and Overlord together in a single process using the optional configuration `druid.coordinator.asOverlord.enabled=true`, described in the [Coordinator configuration documentation](../configuration/index.md#coordinator-operation). While example configurations are provided for very large single machines, at higher scales we recommend running Druid in a [clustered deployment](../tutorials/cluster.md), for fault-tolerance and reduced resource contention. diff --git a/docs/operations/tls-support.md b/docs/operations/tls-support.md index 4eb07d13eb0c..e03e873dae44 100644 --- a/docs/operations/tls-support.md +++ b/docs/operations/tls-support.md @@ -35,9 +35,9 @@ and `druid.tlsPort` properties on each process. Please see `Configuration` secti ## Jetty server configuration Apache Druid uses Jetty as an embedded web server. To get familiar with TLS/SSL in general and related concepts like Certificates etc. -reading this [Jetty documentation](http://www.eclipse.org/jetty/documentation/9.4.x/configuring-ssl.html) might be helpful. +reading this [Jetty documentation](http://www.eclipse.org/jetty/documentation/9.4.32.v20200930/configuring-ssl) might be helpful. To get more in depth knowledge of TLS/SSL support in Java in general, please refer to this [guide](http://docs.oracle.com/javase/8/docs/technotes/guides/security/jsse/JSSERefGuide.html). -The documentation [here](http://www.eclipse.org/jetty/documentation/9.4.x/configuring-ssl.html#configuring-sslcontextfactory) +The documentation [here](http://www.eclipse.org/jetty/documentation/9.4.32.v20200930/configuring-ssl.html#configuring-sslcontextfactory) can help in understanding TLS/SSL configurations listed below. This [document](http://docs.oracle.com/javase/8/docs/technotes/guides/security/StandardNames.html) lists all the possible values for the below mentioned configs among others provided by Java implementation. @@ -75,7 +75,7 @@ The following table contains non-mandatory advanced configuration options, use c ## Internal communication over TLS Whenever possible Druid processes will use HTTPS to talk to each other. To enable this communication Druid's HttpClient needs to -be configured with a proper [SSLContext](http://docs.oracle.com/javase/8/docs/api/javax/net/ssl/SSLContext.html) that is able +be configured with a proper [SSLContext](http://docs.oracle.com/javase/8/docs/api/javax/net/ssl/SSLContext) that is able to validate the Server Certificates, otherwise communication will fail. Since, there are various ways to configure SSLContext, by default, Druid looks for an instance of SSLContext Guice binding diff --git a/docs/querying/aggregations.md b/docs/querying/aggregations.md index 7da5b5c56bd5..5a6a69b8eedb 100644 --- a/docs/querying/aggregations.md +++ b/docs/querying/aggregations.md @@ -379,7 +379,7 @@ As a general guideline for experimentation, the [Moments Sketch paper](https://a #### Fixed Buckets Histogram -Druid also provides a [simple histogram implementation](../development/extensions-core/approximate-histograms.html#fixed-buckets-histogram) that uses a fixed range and fixed number of buckets with support for quantile estimation, backed by an array of bucket count values. +Druid also provides a [simple histogram implementation](../development/extensions-core/approximate-histograms.md#fixed-buckets-histogram) that uses a fixed range and fixed number of buckets with support for quantile estimation, backed by an array of bucket count values. The fixed buckets histogram can perform well when the distribution of the input data allows a small number of buckets to be used. diff --git a/docs/querying/caching.md b/docs/querying/caching.md index 74120c9b2777..b5a0d00ad1c2 100644 --- a/docs/querying/caching.md +++ b/docs/querying/caching.md @@ -71,7 +71,7 @@ enables the Historicals to do their own local result merging and puts less strai Task executor processes such as the Peon or the experimental Indexer only support segment-level caching. Segment-level caching is controlled by the query context parameters `useCache` and `populateCache` -and [runtime properties](../configuration/index.html) `druid.realtime.cache.*`. +and [runtime properties](../configuration/index.md) `druid.realtime.cache.*`. Larger production clusters should enable segment-level cache population on task execution processes only (not on Brokers) to avoid having to use Brokers to merge all query results. Enabling cache population on the @@ -82,17 +82,3 @@ Note that the task executor processes only support caches that keep their data l This restriction exists because the cache stores results at the level of intermediate partial segments generated by the ingestion tasks. These intermediate partial segments will not necessarily be identical across task replicas, so remote cache types such as `memcached` will be ignored by task executor processes. - -## Unsupported queries - -Query caching is not available for following: -- Queries, that involve a `union` datasource, do not support result-level caching. Refer to the -[related issue](https://github.com/apache/druid/issues/8713) for details. Please note that not all union SQL queries are executed using a union datasource. You can use the `explain` operation to see how the union query in sql will be executed. -- Queries, that involve an `Inline` datasource or a `Lookup` datasource, do not support any caching. -- Queries, with a sub-query in them, do not support any caching though the output of sub-queries itself may be cached. -Refer to the [Query execution](query-execution.md#query) page for more details on how sub-queries are executed. -- Join queries do not support any caching on the broker [More details](https://github.com/apache/druid/issues/10444). -- GroupBy v2 queries do not support any caching on broker [More details](https://github.com/apache/druid/issues/3820). -- Data Source Metadata queries are not cached anywhere. -- Queries, that have `bySegment` set in the query context, are not cached on the broker. They are currently cached on -historical but this behavior will potentially be removed in the future. diff --git a/docs/querying/datasource.md b/docs/querying/datasource.md index b7625920835b..cfd355b22926 100644 --- a/docs/querying/datasource.md +++ b/docs/querying/datasource.md @@ -24,7 +24,7 @@ title: "Datasources" Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource, and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true -[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table +[during data ingestion](../ingestion/index.md), where ingestion is always creating or writing into a table datasource. But at query time, there are many other types of datasources available. The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and @@ -51,10 +51,10 @@ SELECT column1, column2 FROM "druid"."dataSourceName" The table datasource is the most common type. This is the kind of datasource you get when you perform -[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster, +[data ingestion](../ingestion/index.md). They are split up into segments, distributed around the cluster, and queried in parallel. -In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table +In [Druid SQL](sql.md#from), table datasources reside in the the `druid` schema. This is the default schema, so table datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`. In native queries, table datasources can be referenced using their names as strings (as in the example above), or by @@ -91,7 +91,7 @@ SELECT k, v FROM lookup.countries ``` -Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from), +Lookup datasources correspond to Druid's key-value [lookup](lookups.md) objects. In [Druid SQL](sql.md#from), they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly. They can be joined onto regular tables using the [join operator](#join). @@ -102,7 +102,7 @@ To see a list of all lookup datasources, use the SQL query `SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`. > Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the -> SQL [`LOOKUP` function](sql.html#string-functions). +> SQL [`LOOKUP` function](sql.md#string-functions). > However, the join operator must evaluate the condition on each row, whereas the > `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is > usually faster than joining to a lookup datasource. @@ -113,16 +113,6 @@ use table datasources. ### `union` - -```sql -SELECT col1, COUNT(*) -FROM ( - SELECT col1, col2, col3 FROM tbl1 - UNION ALL - SELECT col1, col2, col3 FROM tbl2 -) -GROUP BY col1 -``` ```json { @@ -144,6 +134,8 @@ another will be treated as if they contained all null values in the tables where The list of "dataSources" must be nonempty. If you want to query an empty dataset, use an [`inline` datasource](#inline) instead. +Union datasources are not available in Druid SQL. + Refer to the [Query execution](query-execution.md#union) page for more details on how queries are executed when you use union datasources. @@ -332,7 +324,7 @@ Native join datasources have the following properties. All are required. Joins are a feature that can significantly affect performance of your queries. Some performance tips and notes: 1. Joins are especially useful with [lookup datasources](#lookup), but in most cases, the -[`LOOKUP` function](sql.html#string-functions) performs better than a join. Consider using the `LOOKUP` function if +[`LOOKUP` function](sql.md#string-functions) performs better than a join. Consider using the `LOOKUP` function if it is appropriate for your use case. 2. When using joins in Druid SQL, keep in mind that it can generate subqueries that you did not explicitly include in your queries. Refer to the [Druid SQL](sql.md#query-translation) documentation for more details about when this happens diff --git a/docs/querying/dimensionspecs.md b/docs/querying/dimensionspecs.md index fda3fcd55ecf..b2ad5f5599bc 100644 --- a/docs/querying/dimensionspecs.md +++ b/docs/querying/dimensionspecs.md @@ -71,7 +71,7 @@ Please refer to the [Output Types](#output-types) section for more details. ### Filtered DimensionSpecs -These are only useful for multi-value dimensions. If you have a row in Apache Druid that has a multi-value dimension with values ["v1", "v2", "v3"] and you send a groupBy/topN query grouping by that dimension with [query filter](filters.html) for value "v1". In the response you will get 3 rows containing "v1", "v2" and "v3". This behavior might be unintuitive for some use cases. +These are only useful for multi-value dimensions. If you have a row in Apache Druid that has a multi-value dimension with values ["v1", "v2", "v3"] and you send a groupBy/topN query grouping by that dimension with [query filter](filters.md) for value "v1". In the response you will get 3 rows containing "v1", "v2" and "v3". This behavior might be unintuitive for some use cases. It happens because "query filter" is internally used on the bitmaps and only used to match the row to be included in the query result processing. With multi-value dimensions, "query filter" behaves like a contains check, which will match the row with dimension value ["v1", "v2", "v3"]. Please see the section on "Multi-value columns" in [segment](../design/segments.md) for more details. Then groupBy/topN processing pipeline "explodes" all multi-value dimensions resulting 3 rows for "v1", "v2" and "v3" each. @@ -96,7 +96,7 @@ Following filtered dimension spec retains only the values starting with the same { "type" : "prefixFiltered", "delegate" : , "prefix": } ``` -For more details and examples, see [multi-value dimensions](multi-value-dimensions.html). +For more details and examples, see [multi-value dimensions](multi-value-dimensions.md). ### Lookup DimensionSpecs @@ -201,7 +201,7 @@ Returns the dimension value unchanged if the regular expression matches, otherwi ### Search query extraction function -Returns the dimension value unchanged if the given [`SearchQuerySpec`](../querying/searchquery.html#searchqueryspec) +Returns the dimension value unchanged if the given [`SearchQuerySpec`](../querying/searchquery.md#searchqueryspec) matches, otherwise returns null. ```json @@ -254,7 +254,7 @@ For a regular dimension, it assumes the string is formatted in * `format` : date time format for the resulting dimension value, in [Joda Time DateTimeFormat](http://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html), or null to use the default ISO8601 format. * `locale` : locale (language and country) to use, given as a [IETF BCP 47 language tag](http://www.oracle.com/technetwork/java/javase/java8locales-2095355.html#util-text), e.g. `en-US`, `en-GB`, `fr-FR`, `fr-CA`, etc. * `timeZone` : time zone to use in [IANA tz database format](http://en.wikipedia.org/wiki/List_of_tz_database_time_zones), e.g. `Europe/Berlin` (this can possibly be different than the aggregation time-zone) -* `granularity` : [granularity](granularities.html) to apply before formatting, or omit to not apply any granularity. +* `granularity` : [granularity](granularities.md) to apply before formatting, or omit to not apply any granularity. * `asMillis` : boolean value, set to true to treat input strings as millis rather than ISO8601 strings. Additionally, if `format` is null or not specified, output will be in millis rather than ISO8601. ```json @@ -371,7 +371,7 @@ be treated as missing. It is illegal to set `retainMissingValue = true` and also specify a `replaceMissingValueWith`. A property of `injective` can override the lookup's own sense of whether or not it is -[injective](lookups.html#query-execution). If left unspecified, Druid will use the registered cluster-wide lookup +[injective](lookups.md#query-execution). If left unspecified, Druid will use the registered cluster-wide lookup configuration. A property `optimize` can be supplied to allow optimization of lookup based extraction filter (by default `optimize = true`). diff --git a/docs/querying/filters.md b/docs/querying/filters.md index c3b377cba89c..4e8c17b6fd98 100644 --- a/docs/querying/filters.md +++ b/docs/querying/filters.md @@ -137,7 +137,7 @@ The JavaScript filter supports the use of extraction functions, see [Filtering w > The extraction filter is now deprecated. The selector filter with an extraction function specified > provides identical functionality and should be used instead. -Extraction filter matches a dimension using some specific [Extraction function](./dimensionspecs.html#extraction-functions). +Extraction filter matches a dimension using some specific [Extraction function](./dimensionspecs.md#extraction-functions). The following filter matches the values for which the extraction function has transformation entry `input_key=output_value` where `output_value` is equal to the filter `value` and `input_key` is present as dimension. @@ -409,7 +409,7 @@ The filter above is equivalent to the following OR of Bound filters: ### Filtering with Extraction Functions All filters except the "spatial" filter support extraction functions. An extraction function is defined by setting the "extractionFn" field on a filter. -See [Extraction function](./dimensionspecs.html#extraction-functions) for more details on extraction functions. +See [Extraction function](./dimensionspecs.md#extraction-functions) for more details on extraction functions. If specified, the extraction function will be used to transform input values before the filter is applied. The example below shows a selector filter combined with an extraction function. This filter will transform input values @@ -483,7 +483,7 @@ Query filters can also be applied to the timestamp column. The timestamp column to the timestamp column, use the string `__time` as the dimension name. Like numeric dimensions, timestamp filters should be specified as if the timestamp values were strings. -If the user wishes to interpret the timestamp with a specific format, timezone, or locale, the [Time Format Extraction Function](./dimensionspecs.html#time-format-extraction-function) is useful. +If the user wishes to interpret the timestamp with a specific format, timezone, or locale, the [Time Format Extraction Function](./dimensionspecs.md#time-format-extraction-function) is useful. For example, filtering on a long timestamp value: diff --git a/docs/querying/groupbyquery.md b/docs/querying/groupbyquery.md index 652953490321..3852f58106ac 100644 --- a/docs/querying/groupbyquery.md +++ b/docs/querying/groupbyquery.md @@ -93,7 +93,7 @@ Following are main parts to a groupBy query: |aggregations|See [Aggregations](../querying/aggregations.md)|no| |postAggregations|See [Post Aggregations](../querying/post-aggregations.md)|no| |intervals|A JSON Object representing ISO-8601 Intervals. This defines the time ranges to run the query over.|yes| -|subtotalsSpec| A JSON array of arrays to return additional result sets for groupings of subsets of top level `dimensions`. It is [described later](groupbyquery.html#more-on-subtotalsspec) in more detail.|no| +|subtotalsSpec| A JSON array of arrays to return additional result sets for groupings of subsets of top level `dimensions`. It is [described later](groupbyquery.md#more-on-subtotalsspec) in more detail.|no| |context|An additional JSON Object which can be used to specify certain flags.|no| To pull it all together, the above query would return *n\*m* data points, up to a maximum of 5000 points, where n is the cardinality of the `country` dimension, m is the cardinality of the `device` dimension, each day between 2012-01-01 and 2012-01-03, from the `sample_datasource` table. Each data point contains the (long) sum of `total_usage` if the value of the data point is greater than 100, the (double) sum of `data_transfer` and the (double) result of `total_usage` divided by `data_transfer` for the filter set for a particular grouping of `country` and `device`. The output looks like this: @@ -132,10 +132,10 @@ groupBy queries can group on multi-value dimensions. When grouping on a multi-va from matching rows will be used to generate one group per value. It's possible for a query to return more groups than there are rows. For example, a groupBy on the dimension `tags` with filter `"t1" AND "t3"` would match only row1, and generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to include values that match -your filter, you can use a [filtered dimensionSpec](dimensionspecs.html#filtered-dimensionspecs). This can also +your filter, you can use a [filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also improve performance. -See [Multi-value dimensions](multi-value-dimensions.html) for more details. +See [Multi-value dimensions](multi-value-dimensions.md) for more details. ## More on subtotalsSpec @@ -297,9 +297,9 @@ will not exceed available memory for the maximum possible concurrent query load `druid.processing.numMergeBuffers`). See the [basic cluster tuning guide](../operations/basic-cluster-tuning.md) for more details about direct memory usage, organized by Druid process type. -Brokers do not need merge buffers for basic groupBy queries. Queries with subqueries (using a `query` dataSource) require one merge buffer if there is a single subquery, or two merge buffers if there is more than one layer of nested subqueries. Queries with [subtotals](groupbyquery.html#more-on-subtotalsspec) need one merge buffer. These can stack on top of each other: a groupBy query with multiple layers of nested subqueries, and that also uses subtotals, will need three merge buffers. +Brokers do not need merge buffers for basic groupBy queries. Queries with subqueries (using a `query` dataSource) require one merge buffer if there is a single subquery, or two merge buffers if there is more than one layer of nested subqueries. Queries with [subtotals](groupbyquery.md#more-on-subtotalsspec) need one merge buffer. These can stack on top of each other: a groupBy query with multiple layers of nested subqueries, and that also uses subtotals, will need three merge buffers. -Historicals and ingestion tasks need one merge buffer for each groupBy query, unless [parallel combination](groupbyquery.html#parallel-combine) is enabled, in which case they need two merge buffers per query. +Historicals and ingestion tasks need one merge buffer for each groupBy query, unless [parallel combination](groupbyquery.md#parallel-combine) is enabled, in which case they need two merge buffers per query. When using groupBy v1, all aggregation is done on-heap, and resource limits are done through the parameter `druid.query.groupBy.maxResults`. This is a cap on the maximum number of results in a result set. Queries that exceed @@ -353,11 +353,11 @@ computing intermediate aggregates from each segment and another for combining in There are some situations where other query types may be a better choice than groupBy. -- For queries with no "dimensions" (i.e. grouping by time only) the [Timeseries query](timeseriesquery.html) will +- For queries with no "dimensions" (i.e. grouping by time only) the [Timeseries query](timeseriesquery.md) will generally be faster than groupBy. The major differences are that it is implemented in a fully streaming manner (taking advantage of the fact that segments are already sorted on time) and does not need to use a hash table for merging. -- For queries with a single "dimensions" element (i.e. grouping by one string dimension), the [TopN query](topnquery.html) +- For queries with a single "dimensions" element (i.e. grouping by one string dimension), the [TopN query](topnquery.md) will sometimes be faster than groupBy. This is especially true if you are ordering by a metric and find approximate results acceptable. @@ -371,7 +371,7 @@ strategy perform the outer query on the Broker in a single-threaded fashion. ### Configurations -This section describes the configurations for groupBy queries. You can set the runtime properties in the `runtime.properties` file on Broker, Historical, and MiddleManager processes. You can set the query context parameters through the [query context](query-context.html). +This section describes the configurations for groupBy queries. You can set the runtime properties in the `runtime.properties` file on Broker, Historical, and MiddleManager processes. You can set the query context parameters through the [query context](query-context.md). #### Configurations for groupBy v2 diff --git a/docs/querying/having.md b/docs/querying/having.md index 9dbb32fa7fd4..537c5ae9f072 100644 --- a/docs/querying/having.md +++ b/docs/querying/having.md @@ -35,7 +35,7 @@ Apache Druid supports the following types of having clauses. ### Query filters -Query filter HavingSpecs allow all [Druid query filters](filters.html) to be used in the Having part of the query. +Query filter HavingSpecs allow all [Druid query filters](filters.md) to be used in the Having part of the query. The grammar for a query filter HavingSpec is: diff --git a/docs/querying/lookups.md b/docs/querying/lookups.md index b620ef17835f..c698bdefef67 100644 --- a/docs/querying/lookups.md +++ b/docs/querying/lookups.md @@ -56,7 +56,7 @@ Other lookup types are available as extensions, including: Query Syntax ------------ -In [Druid SQL](sql.html), lookups can be queried using the [`LOOKUP` function](sql.md#string-functions), for example: +In [Druid SQL](sql.md), lookups can be queried using the [`LOOKUP` function](sql.md#string-functions), for example: ```sql SELECT @@ -78,7 +78,7 @@ FROM GROUP BY 1 ``` -In native queries, lookups can be queried with [dimension specs or extraction functions](dimensionspecs.html). +In native queries, lookups can be queried with [dimension specs or extraction functions](dimensionspecs.md). Query Execution --------------- diff --git a/docs/querying/multi-value-dimensions.md b/docs/querying/multi-value-dimensions.md index 2c4784298265..2926091dab3b 100644 --- a/docs/querying/multi-value-dimensions.md +++ b/docs/querying/multi-value-dimensions.md @@ -29,8 +29,8 @@ characters). This document describes the behavior of groupBy (topN has similar behavior) queries on multi-value dimensions when they are used as a dimension being grouped by. See the section on multi-value columns in -[segments](../design/segments.html#multi-value-columns) for internal representation details. Examples in this document -are in the form of [native Druid queries](querying.html). Refer to the [Druid SQL documentation](sql.html) for details +[segments](../design/segments.md#multi-value-columns) for internal representation details. Examples in this document +are in the form of [native Druid queries](querying.md). Refer to the [Druid SQL documentation](sql.md) for details about using multi-value string dimensions in SQL. ## Querying multi-value dimensions @@ -47,7 +47,7 @@ called `tags`. ### Filtering -All query types, as well as [filtered aggregators](aggregations.html#filtered-aggregator), can filter on multi-value +All query types, as well as [filtered aggregators](aggregations.md#filtered-aggregator), can filter on multi-value dimensions. Filters follow these rules on multi-value dimensions: - Value filters (like "selector", "bound", and "in") match a row if any of the values of a multi-value dimension match @@ -115,12 +115,12 @@ from matching rows will be used to generate one group per value. This can be tho `UNNEST` operator used on an `ARRAY` type that many SQL dialects support. This means it's possible for a query to return more groups than there are rows. For example, a topN on the dimension `tags` with filter `"t1" AND "t3"` would match only row1, and generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to include values that match -your filter, you can use a [filtered dimensionSpec](dimensionspecs.html#filtered-dimensionspecs). This can also +your filter, you can use a [filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also improve performance. ### Example: GroupBy query with no filtering -See [GroupBy querying](groupbyquery.html) for details. +See [GroupBy querying](groupbyquery.md) for details. ```json { @@ -208,7 +208,7 @@ notice how original rows are "exploded" into multiple rows and merged. ### Example: GroupBy query with a selector query filter -See [query filters](filters.html) for details of selector query filter. +See [query filters](filters.md) for details of selector query filter. ```json { @@ -293,7 +293,7 @@ the multiple values matches the query filter. To solve the problem above and to get only rows for "t3" returned, you would have to use a "filtered dimension spec" as in the query below. -See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.html#filtered-dimensionspecs) for details. +See section on filtered dimensionSpecs in [dimensionSpecs](dimensionspecs.md#filtered-dimensionspecs) for details. ```json { @@ -344,6 +344,6 @@ returns the following result. ] ``` -Note that, for groupBy queries, you could get similar result with a [having spec](having.html) but using a filtered +Note that, for groupBy queries, you could get similar result with a [having spec](having.md) but using a filtered dimensionSpec is much more efficient because that gets applied at the lowest level in the query processing pipeline. Having specs are applied at the outermost level of groupBy query processing. diff --git a/docs/querying/multitenancy.md b/docs/querying/multitenancy.md index 4522a47b7be0..ed07770dddf4 100644 --- a/docs/querying/multitenancy.md +++ b/docs/querying/multitenancy.md @@ -57,7 +57,7 @@ If your multitenant cluster uses shared datasources, most of your queries will l dimension. These sorts of queries perform best when data is well-partitioned by tenant. There are a few ways to accomplish this. -With batch indexing, you can use [single-dimension partitioning](../ingestion/hadoop.html#single-dimension-range-partitioning) +With batch indexing, you can use [single-dimension partitioning](../ingestion/hadoop.md#single-dimension-range-partitioning) to partition your data by tenant_id. Druid always partitions by time first, but the secondary partition within each time bucket will be on tenant_id. diff --git a/docs/querying/query-context.md b/docs/querying/query-context.md index 7bb912c091d4..cab876e91dab 100644 --- a/docs/querying/query-context.md +++ b/docs/querying/query-context.md @@ -39,9 +39,9 @@ These parameters apply to all query types. |property |default | description | |-----------------|----------------------------------------|----------------------| -|timeout | `druid.server.http.defaultQueryTimeout`| Query timeout in millis, beyond which unfinished queries will be cancelled. 0 timeout means `no timeout`. To set the default timeout, see [Broker configuration](../configuration/index.html#broker) | +|timeout | `druid.server.http.defaultQueryTimeout`| Query timeout in millis, beyond which unfinished queries will be cancelled. 0 timeout means `no timeout`. To set the default timeout, see [Broker configuration](../configuration/index.md#broker) | |priority | `0` | Query Priority. Queries with higher priority get precedence for computational resources.| -|lane | `null` | Query lane, used to control usage limits on classes of queries. See [Broker configuration](../configuration/index.html#broker) for more details.| +|lane | `null` | Query lane, used to control usage limits on classes of queries. See [Broker configuration](../configuration/index.md#broker) for more details.| |queryId | auto-generated | Unique identifier given to this query. If a query ID is set or known, this can be used to cancel the query | |useCache | `true` | Flag indicating whether to leverage the query cache for this query. When set to false, it disables reading from the query cache for this query. When set to true, Apache Druid uses `druid.broker.cache.useCache` or `druid.historical.cache.useCache` to determine whether or not to read from the query cache | |populateCache | `true` | Flag indicating whether to save the results of the query to the query cache. Primarily used for debugging. When set to false, it disables saving the results of this query to the query cache. When set to true, Druid uses `druid.broker.cache.populateCache` or `druid.historical.cache.populateCache` to determine whether or not to save the results of this query to the query cache | @@ -49,14 +49,14 @@ These parameters apply to all query types. |populateResultLevelCache | `true` | Flag indicating whether to save the results of the query to the result level cache. Primarily used for debugging. When set to false, it disables saving the results of this query to the query cache. When set to true, Druid uses `druid.broker.cache.populateResultLevelCache` to determine whether or not to save the results of this query to the result-level query cache | |bySegment | `false` | Return "by segment" results. Primarily used for debugging, setting it to `true` returns results associated with the data segment they came from | |finalize | `true` | Flag indicating whether to "finalize" aggregation results. Primarily used for debugging. For instance, the `hyperUnique` aggregator will return the full HyperLogLog sketch instead of the estimated cardinality when this flag is set to `false` | -|maxScatterGatherBytes| `druid.server.http.maxScatterGatherBytes` | Maximum number of bytes gathered from data processes such as Historicals and realtime processes to execute a query. This parameter can be used to further reduce `maxScatterGatherBytes` limit at query time. See [Broker configuration](../configuration/index.html#broker) for more details.| +|maxScatterGatherBytes| `druid.server.http.maxScatterGatherBytes` | Maximum number of bytes gathered from data processes such as Historicals and realtime processes to execute a query. This parameter can be used to further reduce `maxScatterGatherBytes` limit at query time. See [Broker configuration](../configuration/index.md#broker) for more details.| |maxQueuedBytes | `druid.broker.http.maxQueuedBytes` | Maximum number of bytes queued per query before exerting backpressure on the channel to the data server. Similar to `maxScatterGatherBytes`, except unlike that configuration, this one will trigger backpressure rather than query failure. Zero means disabled.| |serializeDateTimeAsLong| `false` | If true, DateTime is serialized as long in the result returned by Broker and the data transportation between Broker and compute process| |serializeDateTimeAsLongInner| `false` | If true, DateTime is serialized as long in the data transportation between Broker and compute process| -|enableParallelMerge|`true`|Enable parallel result merging on the Broker. Note that `druid.processing.merge.useParallelMergePool` must be enabled for this setting to be set to `true`. See [Broker configuration](../configuration/index.html#broker) for more details.| -|parallelMergeParallelism|`druid.processing.merge.pool.parallelism`|Maximum number of parallel threads to use for parallel result merging on the Broker. See [Broker configuration](../configuration/index.html#broker) for more details.| -|parallelMergeInitialYieldRows|`druid.processing.merge.task.initialYieldNumRows`|Number of rows to yield per ForkJoinPool merge task for parallel result merging on the Broker, before forking off a new task to continue merging sequences. See [Broker configuration](../configuration/index.html#broker) for more details.| -|parallelMergeSmallBatchRows|`druid.processing.merge.task.smallBatchNumRows`|Size of result batches to operate on in ForkJoinPool merge tasks for parallel result merging on the Broker. See [Broker configuration](../configuration/index.html#broker) for more details.| +|enableParallelMerge|`true`|Enable parallel result merging on the Broker. Note that `druid.processing.merge.useParallelMergePool` must be enabled for this setting to be set to `true`. See [Broker configuration](../configuration/index.md#broker) for more details.| +|parallelMergeParallelism|`druid.processing.merge.pool.parallelism`|Maximum number of parallel threads to use for parallel result merging on the Broker. See [Broker configuration](../configuration/index.md#broker) for more details.| +|parallelMergeInitialYieldRows|`druid.processing.merge.task.initialYieldNumRows`|Number of rows to yield per ForkJoinPool merge task for parallel result merging on the Broker, before forking off a new task to continue merging sequences. See [Broker configuration](../configuration/index.md#broker) for more details.| +|parallelMergeSmallBatchRows|`druid.processing.merge.task.smallBatchNumRows`|Size of result batches to operate on in ForkJoinPool merge tasks for parallel result merging on the Broker. See [Broker configuration](../configuration/index.md#broker) for more details.| |useFilterCNF|`false`| If true, Druid will attempt to convert the query filter to Conjunctive Normal Form (CNF). During query processing, columns can be pre-filtered by intersecting the bitmap indexes of all values that match the eligible filters, often greatly reducing the raw number of rows which need to be scanned. But this effect only happens for the top level filter, or individual clauses of a top level 'and' filter. As such, filters in CNF potentially have a higher chance to utilize a large amount of bitmap indexes on string columns during pre-filtering. However, this setting should be used with great caution, as it can sometimes have a negative effect on performance, and in some cases, the act of computing CNF of a filter can be expensive. We recommend hand tuning your filters to produce an optimal form if possible, or at least verifying through experimentation that using this parameter actually improves your query performance with no ill-effects.| |secondaryPartitionPruning|`true`|Enable secondary partition pruning on the Broker. The Broker will always prune unnecessary segments from the input scan based on a filter on time intervals, but if the data is further partitioned with hash or range partitioning, this option will enable additional pruning based on a filter on secondary partition dimensions.| @@ -98,7 +98,7 @@ include "selector", "bound", "in", "like", "regex", "search", "and", "or", and " - For GroupBy: No multi-value dimensions. - For Timeseries: No "descending" order. - Only immutable segments (not real-time). -- Only [table datasources](datasource.html#table) (not joins, subqueries, lookups, or inline datasources). +- Only [table datasources](datasource.md#table) (not joins, subqueries, lookups, or inline datasources). Other query types (like TopN, Scan, Select, and Search) ignore the "vectorize" parameter, and will execute without vectorization. These query types will ignore the "vectorize" parameter even if it is set to `"force"`. diff --git a/docs/querying/querying.md b/docs/querying/querying.md index 53e202b7d660..5ba624844005 100644 --- a/docs/querying/querying.md +++ b/docs/querying/querying.md @@ -41,9 +41,8 @@ You can also enter them directly in the Druid console's Query view. Simply pasti ![Native query](../assets/native-queries-01.png "Native query") - Druid's native query language is JSON over HTTP, although many members of the community have contributed different -[client libraries](/libraries.html) in other languages to query Druid. +[client libraries](https://druid.apache.org/libraries.html) in other languages to query Druid. The Content-Type/Accept Headers can also take 'application/x-jackson-smile'. diff --git a/docs/querying/sorting-orders.md b/docs/querying/sorting-orders.md index 3e16eebf7c05..34f80572bf82 100644 --- a/docs/querying/sorting-orders.md +++ b/docs/querying/sorting-orders.md @@ -27,7 +27,7 @@ title: "String comparators" > language. For information about functions available in SQL, refer to the > [SQL documentation](sql.md#scalar-functions). -These sorting orders are used by the [TopNMetricSpec](./topnmetricspec.md), [SearchQuery](./searchquery.md), GroupByQuery's [LimitSpec](./limitspec.md), and [BoundFilter](./filters.html#bound-filter). +These sorting orders are used by the [TopNMetricSpec](./topnmetricspec.md), [SearchQuery](./searchquery.md), GroupByQuery's [LimitSpec](./limitspec.md), and [BoundFilter](./filters.md#bound-filter). ## Lexicographic Sorts values by converting Strings to their UTF-8 byte array representations and comparing lexicographically, byte-by-byte. diff --git a/docs/querying/sql.md b/docs/querying/sql.md index 861397265bf8..170389d47d6f 100644 --- a/docs/querying/sql.md +++ b/docs/querying/sql.md @@ -62,23 +62,23 @@ FROM { | () | [ INNER | LEFT ] JOIN ON condition } The FROM clause can refer to any of the following: -- [Table datasources](datasource.html#table) from the `druid` schema. This is the default schema, so Druid table +- [Table datasources](datasource.md#table) from the `druid` schema. This is the default schema, so Druid table datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`. -- [Lookups](datasource.html#lookup) from the `lookup` schema, for example `lookup.countries`. Note that lookups can +- [Lookups](datasource.md#lookup) from the `lookup` schema, for example `lookup.countries`. Note that lookups can also be queried using the [`LOOKUP` function](#string-functions). -- [Subqueries](datasource.html#query). -- [Joins](datasource.html#join) between anything in this list, except between native datasources (table, lookup, +- [Subqueries](datasource.md#query). +- [Joins](datasource.md#join) between anything in this list, except between native datasources (table, lookup, query) and system tables. The join condition must be an equality between expressions from the left- and right-hand side of the join. - [Metadata tables](#metadata-tables) from the `INFORMATION_SCHEMA` or `sys` schemas. Unlike the other options for the FROM clause, metadata tables are not considered datasources. They exist only in the SQL layer. -For more information about table, lookup, query, and join datasources, refer to the [Datasources](datasource.html) +For more information about table, lookup, query, and join datasources, refer to the [Datasources](datasource.md) documentation. ### WHERE -The WHERE clause refers to columns in the FROM table, and will be translated to [native filters](filters.html). The +The WHERE clause refers to columns in the FROM table, and will be translated to [native filters](filters.md). The WHERE clause can also reference a subquery, like `WHERE col1 IN (SELECT foo FROM ...)`. Queries like this are executed as a join on the subquery, described below in the [Query translation](#subqueries) section. @@ -257,14 +257,14 @@ converted to zeroes). ### Multi-value strings Druid's native type system allows strings to potentially have multiple values. These -[multi-value string dimensions](multi-value-dimensions.html) will be reported in SQL as `VARCHAR` typed, and can be +[multi-value string dimensions](multi-value-dimensions.md) will be reported in SQL as `VARCHAR` typed, and can be syntactically used like any other VARCHAR. Regular string functions that refer to multi-value string dimensions will be applied to all values for each row individually. Multi-value string dimensions can also be treated as arrays via special [multi-value string functions](#multi-value-string-functions), which can perform powerful array-aware operations. Grouping by a multi-value expression will observe the native Druid multi-value aggregation behavior, which is similar to the `UNNEST` functionality available in some other SQL dialects. Refer to the documentation on -[multi-value string dimensions](multi-value-dimensions.html) for additional details. +[multi-value string dimensions](multi-value-dimensions.md) for additional details. > Because multi-value dimensions are treated by the SQL planner as `VARCHAR`, there are some inconsistencies between how > they are handled in Druid SQL and in native queries. For example, expressions involving multi-value dimensions may be @@ -275,7 +275,7 @@ the `UNNEST` functionality available in some other SQL dialects. Refer to the do ### NULL values -The `druid.generic.useDefaultValueForNull` [runtime property](../configuration/index.html#sql-compatible-null-handling) +The `druid.generic.useDefaultValueForNull` [runtime property](../configuration/index.md#sql-compatible-null-handling) controls Druid's NULL handling mode. In the default mode (`true`), Druid treats NULLs and empty strings interchangeably, rather than according to the SQL @@ -314,23 +314,23 @@ Only the COUNT aggregation can accept DISTINCT. |`MAX(expr)`|Takes the maximum of numbers.| |`AVG(expr)`|Averages numbers.| |`APPROX_COUNT_DISTINCT(expr)`|Counts distinct values of expr, which can be a regular column or a hyperUnique column. This is always approximate, regardless of the value of "useApproximateCountDistinct". This uses Druid's built-in "cardinality" or "hyperUnique" aggregators. See also `COUNT(DISTINCT expr)`.| -|`APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])`|Counts distinct values of expr, which can be a regular column or an [HLL sketch](../development/extensions-core/datasketches-hll.html) column. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. This is always approximate, regardless of the value of "useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use this function.| -|`APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])`|Counts distinct values of expr, which can be a regular column or a [Theta sketch](../development/extensions-core/datasketches-theta.html) column. The `size` parameter is described in the Theta sketch documentation. This is always approximate, regardless of the value of "useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use this function.| -|`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL sketch](../development/extensions-core/datasketches-hll.html) on the values of expr, which can be a regular column or a column containing HLL sketches. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use this function.| -|`DS_THETA(expr, [size])`|Creates a [Theta sketch](../development/extensions-core/datasketches-theta.html) on the values of expr, which can be a regular column or a column containing Theta sketches. The `size` parameter is described in the Theta sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use this function.| -|`APPROX_QUANTILE(expr, probability, [resolution])`|Computes approximate quantiles on numeric or [approxHistogram](../development/extensions-core/approximate-histograms.html#approximate-histogram-aggregator) exprs. The "probability" should be between 0 and 1 (exclusive). The "resolution" is the number of centroids to use for the computation. Higher resolutions will give more precise results but also have higher overhead. If not provided, the default resolution is 50. The [approximate histogram extension](../development/extensions-core/approximate-histograms.html) must be loaded to use this function.| -|`APPROX_QUANTILE_DS(expr, probability, [k])`|Computes approximate quantiles on numeric or [Quantiles sketch](../development/extensions-core/datasketches-quantiles.html) exprs. The "probability" should be between 0 and 1 (exclusive). The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use this function.| -|`APPROX_QUANTILE_FIXED_BUCKETS(expr, probability, numBuckets, lowerLimit, upperLimit, [outlierHandlingMode])`|Computes approximate quantiles on numeric or [fixed buckets histogram](../development/extensions-core/approximate-histograms.html#fixed-buckets-histogram) exprs. The "probability" should be between 0 and 1 (exclusive). The `numBuckets`, `lowerLimit`, `upperLimit`, and `outlierHandlingMode` parameters are described in the fixed buckets histogram documentation. The [approximate histogram extension](../development/extensions-core/approximate-histograms.html) must be loaded to use this function.| -|`DS_QUANTILES_SKETCH(expr, [k])`|Creates a [Quantiles sketch](../development/extensions-core/datasketches-quantiles.html) on the values of expr, which can be a regular column or a column containing quantiles sketches. The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use this function.| -|`BLOOM_FILTER(expr, numEntries)`|Computes a bloom filter from values produced by `expr`, with `numEntries` maximum number of distinct values before false positive rate increases. See [bloom filter extension](../development/extensions-core/bloom-filter.html) documentation for additional details.| -|`TDIGEST_QUANTILE(expr, quantileFraction, [compression])`|Builds a T-Digest sketch on values produced by `expr` and returns the value for the quantile. Compression parameter (default value 100) determines the accuracy and size of the sketch. Higher compression means higher accuracy but more space to store sketches. See [t-digest extension](../development/extensions-contrib/tdigestsketch-quantiles.html) documentation for additional details.| -|`TDIGEST_GENERATE_SKETCH(expr, [compression])`|Builds a T-Digest sketch on values produced by `expr`. Compression parameter (default value 100) determines the accuracy and size of the sketch Higher compression means higher accuracy but more space to store sketches. See [t-digest extension](../development/extensions-contrib/tdigestsketch-quantiles.html) documentation for additional details.| -|`VAR_POP(expr)`|Computes variance population of `expr`. See [stats extension](../development/extensions-core/stats.html) documentation for additional details.| -|`VAR_SAMP(expr)`|Computes variance sample of `expr`. See [stats extension](../development/extensions-core/stats.html) documentation for additional details.| -|`VARIANCE(expr)`|Computes variance sample of `expr`. See [stats extension](../development/extensions-core/stats.html) documentation for additional details.| -|`STDDEV_POP(expr)`|Computes standard deviation population of `expr`. See [stats extension](../development/extensions-core/stats.html) documentation for additional details.| -|`STDDEV_SAMP(expr)`|Computes standard deviation sample of `expr`. See [stats extension](../development/extensions-core/stats.html) documentation for additional details.| -|`STDDEV(expr)`|Computes standard deviation sample of `expr`. See [stats extension](../development/extensions-core/stats.html) documentation for additional details.| +|`APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])`|Counts distinct values of expr, which can be a regular column or an [HLL sketch](../development/extensions-core/datasketches-hll.md) column. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. This is always approximate, regardless of the value of "useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.| +|`APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])`|Counts distinct values of expr, which can be a regular column or a [Theta sketch](../development/extensions-core/datasketches-theta.md) column. The `size` parameter is described in the Theta sketch documentation. This is always approximate, regardless of the value of "useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.| +|`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL sketch](../development/extensions-core/datasketches-hll.md) on the values of expr, which can be a regular column or a column containing HLL sketches. The `lgK` and `tgtHllType` parameters are described in the HLL sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.| +|`DS_THETA(expr, [size])`|Creates a [Theta sketch](../development/extensions-core/datasketches-theta.md) on the values of expr, which can be a regular column or a column containing Theta sketches. The `size` parameter is described in the Theta sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.| +|`APPROX_QUANTILE(expr, probability, [resolution])`|Computes approximate quantiles on numeric or [approxHistogram](../development/extensions-core/approximate-histograms.md#approximate-histogram-aggregator) exprs. The "probability" should be between 0 and 1 (exclusive). The "resolution" is the number of centroids to use for the computation. Higher resolutions will give more precise results but also have higher overhead. If not provided, the default resolution is 50. The [approximate histogram extension](../development/extensions-core/approximate-histograms.md) must be loaded to use this function.| +|`APPROX_QUANTILE_DS(expr, probability, [k])`|Computes approximate quantiles on numeric or [Quantiles sketch](../development/extensions-core/datasketches-quantiles.md) exprs. The "probability" should be between 0 and 1 (exclusive). The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.| +|`APPROX_QUANTILE_FIXED_BUCKETS(expr, probability, numBuckets, lowerLimit, upperLimit, [outlierHandlingMode])`|Computes approximate quantiles on numeric or [fixed buckets histogram](../development/extensions-core/approximate-histograms.md#fixed-buckets-histogram) exprs. The "probability" should be between 0 and 1 (exclusive). The `numBuckets`, `lowerLimit`, `upperLimit`, and `outlierHandlingMode` parameters are described in the fixed buckets histogram documentation. The [approximate histogram extension](../development/extensions-core/approximate-histograms.md) must be loaded to use this function.| +|`DS_QUANTILES_SKETCH(expr, [k])`|Creates a [Quantiles sketch](../development/extensions-core/datasketches-quantiles.md) on the values of expr, which can be a regular column or a column containing quantiles sketches. The `k` parameter is described in the Quantiles sketch documentation. The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use this function.| +|`BLOOM_FILTER(expr, numEntries)`|Computes a bloom filter from values produced by `expr`, with `numEntries` maximum number of distinct values before false positive rate increases. See [bloom filter extension](../development/extensions-core/bloom-filter.md) documentation for additional details.| +|`TDIGEST_QUANTILE(expr, quantileFraction, [compression])`|Builds a T-Digest sketch on values produced by `expr` and returns the value for the quantile. Compression parameter (default value 100) determines the accuracy and size of the sketch. Higher compression means higher accuracy but more space to store sketches. See [t-digest extension](../development/extensions-contrib/tdigestsketch-quantiles.md) documentation for additional details.| +|`TDIGEST_GENERATE_SKETCH(expr, [compression])`|Builds a T-Digest sketch on values produced by `expr`. Compression parameter (default value 100) determines the accuracy and size of the sketch Higher compression means higher accuracy but more space to store sketches. See [t-digest extension](../development/extensions-contrib/tdigestsketch-quantiles.md) documentation for additional details.| +|`VAR_POP(expr)`|Computes variance population of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.| +|`VAR_SAMP(expr)`|Computes variance sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.| +|`VARIANCE(expr)`|Computes variance sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.| +|`STDDEV_POP(expr)`|Computes standard deviation population of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.| +|`STDDEV_SAMP(expr)`|Computes standard deviation sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.| +|`STDDEV(expr)`|Computes standard deviation sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.| |`EARLIEST(expr)`|Returns the earliest value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "earliest" is the value first encountered with the minimum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the first value encountered.| |`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| |`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "latest" is the value last encountered with the maximum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the last value encountered.| @@ -338,7 +338,7 @@ Only the COUNT aggregation can accept DISTINCT. |`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be numeric. This aggregator can simplify and optimize the performance by returning the first encountered value (including null)| |`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. The `maxBytesPerString` parameter determines how much aggregation space to allocate per string. Strings longer than this limit will be truncated. This parameter should be set as low as possible, since high values will lead to wasted memory.| -For advice on choosing approximate aggregation functions, check out our [approximate aggregations documentation](aggregations.html#approx). +For advice on choosing approximate aggregation functions, check out our [approximate aggregations documentation](aggregations.md#approx). ## Scalar functions @@ -391,7 +391,7 @@ String functions accept strings, and return a type appropriate to the function. |`CHAR_LENGTH(expr)`|Synonym for `LENGTH`.| |`CHARACTER_LENGTH(expr)`|Synonym for `LENGTH`.| |`STRLEN(expr)`|Synonym for `LENGTH`.| -|`LOOKUP(expr, lookupName)`|Look up expr in a registered [query-time lookup table](lookups.html). Note that lookups can also be queried directly using the [`lookup` schema](#from).| +|`LOOKUP(expr, lookupName)`|Look up expr in a registered [query-time lookup table](lookups.md). Note that lookups can also be queried directly using the [`lookup` schema](#from).| |`LOWER(expr)`|Returns expr in all lowercase.| |`PARSE_LONG(string[, radix])`|Parses a string into a long (BIGINT) with the given radix, or 10 (decimal) if a radix is not provided.| |`POSITION(needle IN haystack [FROM fromIndex])`|Returns the index of needle within haystack, with indexes starting from 1. The search will begin at fromIndex, or 1 if fromIndex is not specified. If the needle is not found, returns 0.| @@ -515,8 +515,8 @@ These functions operate on expressions or columns that return sketch objects. #### HLL sketch functions -The following functions operate on [DataSketches HLL sketches](../development/extensions-core/datasketches-hll.html). -The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use the following functions. +The following functions operate on [DataSketches HLL sketches](../development/extensions-core/datasketches-hll.md). +The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use the following functions. |Function|Notes| |--------|-----| @@ -527,8 +527,8 @@ The [DataSketches extension](../development/extensions-core/datasketches-extensi #### Theta sketch functions -The following functions operate on [theta sketches](../development/extensions-core/datasketches-theta.html). -The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use the following functions. +The following functions operate on [theta sketches](../development/extensions-core/datasketches-theta.md). +The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use the following functions. |Function|Notes| |--------|-----| @@ -540,8 +540,8 @@ The [DataSketches extension](../development/extensions-core/datasketches-extensi #### Quantiles sketch functions -The following functions operate on [quantiles sketches](../development/extensions-core/datasketches-quantiles.html). -The [DataSketches extension](../development/extensions-core/datasketches-extension.html) must be loaded to use the following functions. +The following functions operate on [quantiles sketches](../development/extensions-core/datasketches-quantiles.md). +The [DataSketches extension](../development/extensions-core/datasketches-extension.md) must be loaded to use the following functions. |Function|Notes| |--------|-----| @@ -562,7 +562,7 @@ The [DataSketches extension](../development/extensions-core/datasketches-extensi |`NULLIF(value1, value2)`|Returns NULL if value1 and value2 match, else returns value1.| |`COALESCE(value1, value2, ...)`|Returns the first value that is neither NULL nor empty string.| |`NVL(expr,expr-for-null)`|Returns 'expr-for-null' if 'expr' is null (or empty string for string type).| -|`BLOOM_FILTER_TEST(, )`|Returns true if the value is contained in a Base64-serialized bloom filter. See the [Bloom filter extension](../development/extensions-core/bloom-filter.html) documentation for additional details.| +|`BLOOM_FILTER_TEST(, )`|Returns true if the value is contained in a Base64-serialized bloom filter. See the [Bloom filter extension](../development/extensions-core/bloom-filter.md) documentation for additional details.| ## Multi-value string functions @@ -698,20 +698,20 @@ enabling logging and running this query, we can see that it actually runs as the Druid SQL uses four different native query types. -- [Scan](scan-query.html) is used for queries that do not aggregate (no GROUP BY, no DISTINCT). +- [Scan](scan-query.md) is used for queries that do not aggregate (no GROUP BY, no DISTINCT). -- [Timeseries](timeseriesquery.html) is used for queries that GROUP BY `FLOOR(__time TO )` or `TIME_FLOOR(__time, +- [Timeseries](timeseriesquery.md) is used for queries that GROUP BY `FLOOR(__time TO )` or `TIME_FLOOR(__time, period)`, have no other grouping expressions, no HAVING or LIMIT clauses, no nesting, and either no ORDER BY, or an ORDER BY that orders by same expression as present in GROUP BY. It also uses Timeseries for "grand total" queries that have aggregation functions but no GROUP BY. This query type takes advantage of the fact that Druid segments are sorted by time. -- [TopN](topnquery.html) is used by default for queries that group by a single expression, do have ORDER BY and LIMIT +- [TopN](topnquery.md) is used by default for queries that group by a single expression, do have ORDER BY and LIMIT clauses, do not have HAVING clauses, and are not nested. However, the TopN query type will deliver approximate ranking and results in some cases; if you want to avoid this, set "useApproximateTopN" to "false". TopN results are always computed in memory. See the TopN documentation for more details. -- [GroupBy](groupbyquery.html) is used for all other aggregations, including any nested aggregation queries. Druid's +- [GroupBy](groupbyquery.md) is used for all other aggregations, including any nested aggregation queries. Druid's GroupBy is a traditional aggregation engine: it delivers exact results and rankings and supports a wide variety of features. GroupBy aggregates in memory if it can, but it may spill to disk if it doesn't have enough memory to complete your query. Results are streamed back from data processes through the Broker if you ORDER BY the same expressions in your @@ -796,9 +796,9 @@ Druid does not support all SQL features. In particular, the following features a Additionally, some Druid native query features are not supported by the SQL language. Some unsupported Druid features include: -- [Inline datasources](datasource.html#inline). -- [Spatial filters](../development/geo.html). -- [Query cancellation](querying.html#query-cancellation). +- [Inline datasources](datasource.md#inline). +- [Spatial filters](../development/geo.md). +- [Query cancellation](querying.md#query-cancellation). - [Multi-value dimensions](#multi-value-strings) are only partially implemented in Druid SQL. There are known inconsistencies between their behavior in SQL queries and in native queries due to how they are currently treated by the SQL planner. @@ -895,7 +895,7 @@ will be a list of column names. For the `object` and `objectLines` formats, the keys are column names, and the values are null. Errors that occur before the response body is sent will be reported in JSON, with an HTTP 500 status code, in the -same format as [native Druid query errors](../querying/querying.html#query-errors). If an error occurs while the response body is +same format as [native Druid query errors](../querying/querying.md#query-errors). If an error occurs while the response body is being sent, at that point it is too late to change the HTTP status code or report a JSON error, so the response will simply end midstream and an error will be logged by the Druid server that was handling your request. @@ -959,7 +959,7 @@ final ResultSet resultSet = statement.executeQuery(); Druid SQL supports setting connection parameters on the client. The parameters in the table below affect SQL planning. All other context parameters you provide will be attached to Druid queries and can affect how they run. See -[Query context](query-context.html) for details on the possible options. +[Query context](query-context.md) for details on the possible options. ```java String url = "jdbc:avatica:remote:url=http://localhost:8082/druid/v2/sql/avatica/"; @@ -984,13 +984,13 @@ Connection context can be specified as JDBC connection properties or as a "conte |`sqlQueryId`|Unique identifier given to this SQL query. For HTTP client, it will be returned in `X-Druid-SQL-Query-Id` header.|auto-generated| |`sqlTimeZone`|Sets the time zone for this connection, which will affect how time functions and timestamp literals behave. Should be a time zone name like "America/Los_Angeles" or offset like "-08:00".|druid.sql.planner.sqlTimeZone on the Broker (default: UTC)| |`useApproximateCountDistinct`|Whether to use an approximate cardinality algorithm for `COUNT(DISTINCT foo)`.|druid.sql.planner.useApproximateCountDistinct on the Broker (default: true)| -|`useApproximateTopN`|Whether to use approximate [TopN queries](topnquery.html) when a SQL query could be expressed as such. If false, exact [GroupBy queries](groupbyquery.html) will be used instead.|druid.sql.planner.useApproximateTopN on the Broker (default: true)| +|`useApproximateTopN`|Whether to use approximate [TopN queries](topnquery.md) when a SQL query could be expressed as such. If false, exact [GroupBy queries](groupbyquery.md) will be used instead.|druid.sql.planner.useApproximateTopN on the Broker (default: true)| ## Metadata tables Druid Brokers infer table and column metadata for each datasource from segments loaded in the cluster, and use this to plan SQL queries. This metadata is cached on Broker startup and also updated periodically in the background through -[SegmentMetadata queries](segmentmetadataquery.html). Background metadata refreshing is triggered by +[SegmentMetadata queries](segmentmetadataquery.md). Background metadata refreshing is triggered by segments entering and exiting the cluster, and can also be throttled through configuration. Druid exposes system information through special system tables. There are two such schemas available: Information Schema and Sys Schema. @@ -1133,9 +1133,9 @@ Servers table lists all discovered servers in the cluster. |plaintext_port|LONG|Unsecured port of the server, or -1 if plaintext traffic is disabled| |tls_port|LONG|TLS port of the server, or -1 if TLS is disabled| |server_type|STRING|Type of Druid service. Possible values include: COORDINATOR, OVERLORD, BROKER, ROUTER, HISTORICAL, MIDDLE_MANAGER or PEON.| -|tier|STRING|Distribution tier see [druid.server.tier](../configuration/index.html#historical-general-configuration). Only valid for HISTORICAL type, for other types it's null| +|tier|STRING|Distribution tier see [druid.server.tier](../configuration/index.md#historical-general-configuration). Only valid for HISTORICAL type, for other types it's null| |current_size|LONG|Current size of segments in bytes on this server. Only valid for HISTORICAL type, for other types it's 0| -|max_size|LONG|Max size in bytes this server recommends to assign to segments see [druid.server.maxSize](../configuration/index.html#historical-general-configuration). Only valid for HISTORICAL type, for other types it's 0| +|max_size|LONG|Max size in bytes this server recommends to assign to segments see [druid.server.maxSize](../configuration/index.md#historical-general-configuration). Only valid for HISTORICAL type, for other types it's 0| To retrieve information about all servers, use the query: @@ -1168,13 +1168,13 @@ GROUP BY servers.server; #### TASKS table The tasks table provides information about active and recently-completed indexing tasks. For more information -check out the documentation for [ingestion tasks](../ingestion/tasks.html). +check out the documentation for [ingestion tasks](../ingestion/tasks.md). |Column|Type|Notes| |------|-----|-----| |task_id|STRING|Unique task identifier| |group_id|STRING|Task group ID for this task, the value depends on the task `type`. For example, for native index tasks, it's same as `task_id`, for sub tasks, this value is the parent task's ID| -|type|STRING|Task type, for example this value is "index" for indexing tasks. See [tasks-overview](../ingestion/tasks.html)| +|type|STRING|Task type, for example this value is "index" for indexing tasks. See [tasks-overview](../ingestion/tasks.md)| |datasource|STRING|Datasource name being indexed| |created_time|STRING|Timestamp in ISO8601 format corresponding to when the ingestion task was created. Note that this value is populated for completed and waiting tasks. For running and pending tasks this value is set to 1970-01-01T00:00:00Z| |queue_insertion_time|STRING|Timestamp in ISO8601 format corresponding to when this task was added to the queue on the Overlord| @@ -1200,8 +1200,8 @@ The supervisors table provides information about supervisors. |Column|Type|Notes| |------|-----|-----| |supervisor_id|STRING|Supervisor task identifier| -|state|STRING|Basic state of the supervisor. Available states: `UNHEALTHY_SUPERVISOR`, `UNHEALTHY_TASKS`, `PENDING`, `RUNNING`, `SUSPENDED`, `STOPPING`. Check [Kafka Docs](../development/extensions-core/kafka-ingestion.html#operations) for details.| -|detailed_state|STRING|Supervisor specific state. (See documentation of the specific supervisor for details, e.g. [Kafka](../development/extensions-core/kafka-ingestion.html) or [Kinesis](../development/extensions-core/kinesis-ingestion.html))| +|state|STRING|Basic state of the supervisor. Available states: `UNHEALTHY_SUPERVISOR`, `UNHEALTHY_TASKS`, `PENDING`, `RUNNING`, `SUSPENDED`, `STOPPING`. Check [Kafka Docs](../development/extensions-core/kafka-ingestion.md#operations) for details.| +|detailed_state|STRING|Supervisor specific state. (See documentation of the specific supervisor for details, e.g. [Kafka](../development/extensions-core/kafka-ingestion.md) or [Kinesis](../development/extensions-core/kinesis-ingestion.md))| |healthy|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 indicates a healthy supervisor| |type|STRING|Type of supervisor, e.g. `kafka`, `kinesis` or `materialized_view`| |source|STRING|Source of the supervisor, e.g. Kafka topic or Kinesis stream| @@ -1217,9 +1217,9 @@ SELECT * FROM sys.supervisors WHERE healthy=0; ## Server configuration Druid SQL planning occurs on the Broker and is configured by -[Broker runtime properties](../configuration/index.html#sql). +[Broker runtime properties](../configuration/index.md#sql). ## Security -Please see [Defining SQL permissions](../development/extensions-core/druid-basic-security.html#sql-permissions) in the +Please see [Defining SQL permissions](../development/extensions-core/druid-basic-security.md#sql-permissions) in the basic security documentation for information on what permissions are needed for making SQL queries. diff --git a/docs/querying/topnquery.md b/docs/querying/topnquery.md index ba542cf1dc3b..a5003ddb7963 100644 --- a/docs/querying/topnquery.md +++ b/docs/querying/topnquery.md @@ -159,10 +159,10 @@ topN queries can group on multi-value dimensions. When grouping on a multi-value from matching rows will be used to generate one group per value. It's possible for a query to return more groups than there are rows. For example, a topN on the dimension `tags` with filter `"t1" AND "t3"` would match only row1, and generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to include values that match -your filter, you can use a [filtered dimensionSpec](dimensionspecs.html#filtered-dimensionspecs). This can also +your filter, you can use a [filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also improve performance. -See [Multi-value dimensions](multi-value-dimensions.html) for more details. +See [Multi-value dimensions](multi-value-dimensions.md) for more details. ## Aliasing diff --git a/docs/tutorials/cluster.md b/docs/tutorials/cluster.md index bc64495ece77..74a7cf769e25 100644 --- a/docs/tutorials/cluster.md +++ b/docs/tutorials/cluster.md @@ -161,12 +161,12 @@ cd apache-druid-{{DRUIDVERSION}} In the package, you should find: * `LICENSE` and `NOTICE` files -* `bin/*` - scripts related to the [single-machine quickstart](index.html) +* `bin/*` - scripts related to the [single-machine quickstart](index.md) * `conf/druid/cluster/*` - template configurations for a clustered setup * `extensions/*` - core Druid extensions * `hadoop-dependencies/*` - Druid Hadoop dependencies * `lib/*` - libraries and dependencies for core Druid -* `quickstart/*` - files related to the [single-machine quickstart](index.html) +* `quickstart/*` - files related to the [single-machine quickstart](index.md) We'll be editing the files in `conf/druid/cluster/` in order to get things running. diff --git a/docs/tutorials/tutorial-batch-hadoop.md b/docs/tutorials/tutorial-batch-hadoop.md index bd02464e173a..47cd2d6bcbe5 100644 --- a/docs/tutorials/tutorial-batch-hadoop.md +++ b/docs/tutorials/tutorial-batch-hadoop.md @@ -27,8 +27,8 @@ sidebar_label: "Load from Apache Hadoop" This tutorial shows you how to load data files into Apache Druid using a remote Hadoop cluster. For this tutorial, we'll assume that you've already completed the previous -[batch ingestion tutorial](tutorial-batch.html) using Druid's native batch ingestion system and are using the -`micro-quickstart` single-machine configuration as described in the [quickstart](index.html). +[batch ingestion tutorial](tutorial-batch.md) using Druid's native batch ingestion system and are using the +`micro-quickstart` single-machine configuration as described in the [quickstart](index.md). ## Install Docker diff --git a/docs/tutorials/tutorial-compaction.md b/docs/tutorials/tutorial-compaction.md index 98052170cf71..05d5b724af9d 100644 --- a/docs/tutorials/tutorial-compaction.md +++ b/docs/tutorials/tutorial-compaction.md @@ -30,7 +30,7 @@ Because there is some per-segment memory and processing overhead, it can sometim Please check [Segment size optimization](../operations/segment-optimization.md) for details. For this tutorial, we'll assume you've already downloaded Apache Druid as described in -the [single-machine quickstart](index.html) and have it running on your local machine. +the [single-machine quickstart](index.md) and have it running on your local machine. It will also be helpful to have finished [Tutorial: Loading a file](../tutorials/tutorial-batch.md) and [Tutorial: Querying data](../tutorials/tutorial-query.md). diff --git a/docs/tutorials/tutorial-delete-data.md b/docs/tutorials/tutorial-delete-data.md index 4f08b0ebdaf3..ba2a6f3ee88f 100644 --- a/docs/tutorials/tutorial-delete-data.md +++ b/docs/tutorials/tutorial-delete-data.md @@ -27,7 +27,7 @@ sidebar_label: "Deleting data" This tutorial demonstrates how to delete existing data. For this tutorial, we'll assume you've already downloaded Apache Druid as described in -the [single-machine quickstart](index.html) and have it running on your local machine. +the [single-machine quickstart](index.md) and have it running on your local machine. ## Load initial data @@ -39,7 +39,7 @@ Let's load this initial data: bin/post-index-task --file quickstart/tutorial/deletion-index.json --url http://localhost:8081 ``` -When the load finishes, open [http://localhost:8888/unified-console.html#datasources](http://localhost:8888/unified-console.html#datasources) in a browser. +When the load finishes, open [http://localhost:8888/unified-console.md#datasources](http://localhost:8888/unified-console.html#datasources) in a browser. ## How to permanently delete data diff --git a/docs/tutorials/tutorial-ingestion-spec.md b/docs/tutorials/tutorial-ingestion-spec.md index 773b920b34b2..821c376f637f 100644 --- a/docs/tutorials/tutorial-ingestion-spec.md +++ b/docs/tutorials/tutorial-ingestion-spec.md @@ -27,7 +27,7 @@ sidebar_label: "Writing an ingestion spec" This tutorial will guide the reader through the process of defining an ingestion spec, pointing out key considerations and guidelines. For this tutorial, we'll assume you've already downloaded Apache Druid as described in -the [single-machine quickstart](index.html) and have it running on your local machine. +the [single-machine quickstart](index.md) and have it running on your local machine. It will also be helpful to have finished [Tutorial: Loading a file](../tutorials/tutorial-batch.md), [Tutorial: Querying data](../tutorials/tutorial-query.md), and [Tutorial: Rollup](../tutorials/tutorial-rollup.md). diff --git a/docs/tutorials/tutorial-kafka.md b/docs/tutorials/tutorial-kafka.md index 2c59e8a37842..e6c539b1733e 100644 --- a/docs/tutorials/tutorial-kafka.md +++ b/docs/tutorials/tutorial-kafka.md @@ -29,7 +29,7 @@ sidebar_label: "Load from Apache Kafka" This tutorial demonstrates how to load data into Apache Druid from a Kafka stream, using Druid's Kafka indexing service. For this tutorial, we'll assume you've already downloaded Druid as described in -the [quickstart](index.html) using the `micro-quickstart` single-machine configuration and have it +the [quickstart](index.md) using the `micro-quickstart` single-machine configuration and have it running on your local machine. You don't need to have loaded any data yet. ## Download and start Kafka @@ -254,7 +254,7 @@ If the supervisor was successfully created, you will get a response containing t For more details about what's going on here, check out the [Druid Kafka indexing service documentation](../development/extensions-core/kafka-ingestion.md). -You can view the current supervisors and tasks in the Druid Console: [http://localhost:8888/unified-console.html#tasks](http://localhost:8888/unified-console.html#tasks). +You can view the current supervisors and tasks in the Druid Console: [http://localhost:8888/unified-console.md#tasks](http://localhost:8888/unified-console.html#tasks). ## Querying your data diff --git a/docs/tutorials/tutorial-retention.md b/docs/tutorials/tutorial-retention.md index e4ff42895273..d73c37c2b5b4 100644 --- a/docs/tutorials/tutorial-retention.md +++ b/docs/tutorials/tutorial-retention.md @@ -27,7 +27,7 @@ sidebar_label: "Configuring data retention" This tutorial demonstrates how to configure retention rules on a datasource to set the time intervals of data that will be retained or dropped. For this tutorial, we'll assume you've already downloaded Apache Druid as described in -the [single-machine quickstart](index.html) and have it running on your local machine. +the [single-machine quickstart](index.md) and have it running on your local machine. It will also be helpful to have finished [Tutorial: Loading a file](../tutorials/tutorial-batch.md) and [Tutorial: Querying data](../tutorials/tutorial-query.md). diff --git a/docs/tutorials/tutorial-rollup.md b/docs/tutorials/tutorial-rollup.md index 79276176925c..19f68327db03 100644 --- a/docs/tutorials/tutorial-rollup.md +++ b/docs/tutorials/tutorial-rollup.md @@ -29,7 +29,7 @@ Apache Druid can summarize raw data at ingestion time using a process we refer t This tutorial will demonstrate the effects of roll-up on an example dataset. For this tutorial, we'll assume you've already downloaded Druid as described in -the [single-machine quickstart](index.html) and have it running on your local machine. +the [single-machine quickstart](index.md) and have it running on your local machine. It will also be helpful to have finished [Tutorial: Loading a file](../tutorials/tutorial-batch.md) and [Tutorial: Querying data](../tutorials/tutorial-query.md). diff --git a/docs/tutorials/tutorial-transform-spec.md b/docs/tutorials/tutorial-transform-spec.md index 35695de2497b..356efc597e6f 100644 --- a/docs/tutorials/tutorial-transform-spec.md +++ b/docs/tutorials/tutorial-transform-spec.md @@ -27,7 +27,7 @@ sidebar_label: "Transforming input data" This tutorial will demonstrate how to use transform specs to filter and transform input data during ingestion. For this tutorial, we'll assume you've already downloaded Apache Druid as described in -the [single-machine quickstart](index.html) and have it running on your local machine. +the [single-machine quickstart](index.md) and have it running on your local machine. It will also be helpful to have finished [Tutorial: Loading a file](../tutorials/tutorial-batch.md) and [Tutorial: Querying data](../tutorials/tutorial-query.md). diff --git a/docs/tutorials/tutorial-update-data.md b/docs/tutorials/tutorial-update-data.md index 804385028cfa..0280020ead45 100644 --- a/docs/tutorials/tutorial-update-data.md +++ b/docs/tutorials/tutorial-update-data.md @@ -27,7 +27,7 @@ sidebar_label: "Updating existing data" This tutorial demonstrates how to update existing data, showing both overwrites and appends. For this tutorial, we'll assume you've already downloaded Apache Druid as described in -the [single-machine quickstart](index.html) and have it running on your local machine. +the [single-machine quickstart](index.md) and have it running on your local machine. It will also be helpful to have finished [Tutorial: Loading a file](../tutorials/tutorial-batch.md), [Tutorial: Querying data](../tutorials/tutorial-query.md), and [Tutorial: Rollup](../tutorials/tutorial-rollup.md). From 136ed5bddaf416972f15a0fcebc6f8c3fd9fcd95 Mon Sep 17 00:00:00 2001 From: Steve Hetland Date: Thu, 22 Oct 2020 16:55:56 -0700 Subject: [PATCH 2/5] reverting local link --- docs/design/index.md | 2 +- docs/design/indexer.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/design/index.md b/docs/design/index.md index 3072ce397ff6..efa681b99dae 100644 --- a/docs/design/index.md +++ b/docs/design/index.md @@ -77,7 +77,7 @@ summarization partially pre-aggregates your data, and can lead to big costs savi ## When should I use Druid? Druid is used by many companies of various sizes for many different use cases. Check out the -[Powered by Apache Druid](https://druid.apache.org/druid-powered) page +[Powered by Apache Druid](/druid-powered) page Druid is likely a good choice if your use case fits a few of the following descriptors: diff --git a/docs/design/indexer.md b/docs/design/indexer.md index ea6594c7c9f5..9e1e9807888e 100644 --- a/docs/design/indexer.md +++ b/docs/design/indexer.md @@ -22,7 +22,7 @@ title: "Indexer Process" ~ under the License. --> -> The Indexer is an optional and [experimental](../../development/experimental) feature. +> The Indexer is an optional and [experimental](../../development/experimental.md) feature. > Its memory management system is still under development and will be significantly enhanced in later releases. The Apache Druid Indexer process is an alternative to the MiddleManager + Peon task execution system. Instead of forking a separate JVM process per-task, the Indexer runs tasks as separate threads within a single JVM process. @@ -91,4 +91,4 @@ Separate task logs are not currently supported when using the Indexer; all task The Indexer currently imposes an identical memory limit on each task. In later releases, the per-task memory limit will be removed and only the global limit will apply. The limit on concurrent merges will also be removed. -In later releases, per-task memory usage will be dynamically managed. Please see https://github.com/apache/druid/issues/7900 for details on future enhancements to the Indexer. \ No newline at end of file +In later releases, per-task memory usage will be dynamically managed. Please see https://github.com/apache/druid/issues/7900 for details on future enhancements to the Indexer. From a23fb5ab9f62e1eeca72d38bb913648d4d0ffa4d Mon Sep 17 00:00:00 2001 From: sthetland Date: Mon, 2 Nov 2020 14:32:29 -0800 Subject: [PATCH 3/5] Update indexer.md --- docs/design/indexer.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/indexer.md b/docs/design/indexer.md index 9e1e9807888e..e5930dece5fe 100644 --- a/docs/design/indexer.md +++ b/docs/design/indexer.md @@ -22,7 +22,7 @@ title: "Indexer Process" ~ under the License. --> -> The Indexer is an optional and [experimental](../../development/experimental.md) feature. +> The Indexer is an optional and [experimental](../development/experimental.md) feature. > Its memory management system is still under development and will be significantly enhanced in later releases. The Apache Druid Indexer process is an alternative to the MiddleManager + Peon task execution system. Instead of forking a separate JVM process per-task, the Indexer runs tasks as separate threads within a single JVM process. From 7db499163640673da3985eca67c50a3eeaf3983b Mon Sep 17 00:00:00 2001 From: Steve Hetland Date: Tue, 3 Nov 2020 10:49:18 -0800 Subject: [PATCH 4/5] link checking --- docs/configuration/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration/index.md b/docs/configuration/index.md index 5893d3401f51..949131393710 100644 --- a/docs/configuration/index.md +++ b/docs/configuration/index.md @@ -893,7 +893,7 @@ These Overlord static configurations can be defined in the `overlord/runtime.pro |`druid.indexer.queue.restartDelay`|Sleep this long when Overlord queue management throws an exception before trying again.|PT30S| |`druid.indexer.queue.storageSyncRate`|Sync Overlord state this often with an underlying task persistence mechanism.|PT1M| -The following configs only apply if the Overlord is running in remote mode. For a description of local vs. remote mode, please see (../design/overlord.md). +The following configs only apply if the Overlord is running in remote mode. For a description of local vs. remote mode, see [Overlord Process](../design/overlord.md). |Property|Description|Default| |--------|-----------|-------| From 608b7a8202bdc2f1e36d08ec1d8d03a873c523ec Mon Sep 17 00:00:00 2001 From: sthetland Date: Wed, 11 Nov 2020 16:56:50 -0800 Subject: [PATCH 5/5] Fixing one more stale link for PostgreSQL --- docs/operations/high-availability.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/operations/high-availability.md b/docs/operations/high-availability.md index 240a44ca3639..4e8927f582f0 100644 --- a/docs/operations/high-availability.md +++ b/docs/operations/high-availability.md @@ -30,7 +30,7 @@ We recommend either installing ZooKeeper on its own hardware, or running 3 or 5 and configuring ZooKeeper on them appropriately. See the [ZooKeeper admin guide](https://zookeeper.apache.org/doc/current/zookeeperAdmin) for more details. - For highly-available metadata storage, we recommend MySQL or PostgreSQL with replication and failover enabled. See [MySQL HA/Scalability Guide](https://dev.mysql.com/doc/mysql-ha-scalability/en/) -and [PostgreSQL's High Availability, Load Balancing, and Replication](https://www.postgresql.org/docs/9.5/high-availability) for MySQL and PostgreSQL, respectively. +and [PostgreSQL's High Availability, Load Balancing, and Replication](https://www.postgresql.org/docs/current/high-availability.html) for MySQL and PostgreSQL, respectively. - For highly-available Apache Druid Coordinators and Overlords, we recommend to run multiple servers. If they are all configured to use the same ZooKeeper cluster and metadata storage, then they will automatically failover between each other as necessary.