Skip to content

Update doc for allowedHeaders#17045

Merged
abhishekagarwal87 merged 49 commits intoapache:masterfrom
pranavbhole:updateDocForAllowedHeaders
Sep 19, 2024
Merged

Update doc for allowedHeaders#17045
abhishekagarwal87 merged 49 commits intoapache:masterfrom
pranavbhole:updateDocForAllowedHeaders

Conversation

@pranavbhole
Copy link
Copy Markdown
Contributor

Update doc for: #16974

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Comment thread docs/configuration/index.md Outdated
|Property| Possible values | Description |Default|
|--------|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------|-------|
|`druid.ingestion.http.allowedProtocols`| List of protocols | Allowed protocols for the HTTP input source. |`["http", "https"]`|
|`druid.ingestion.http.allowedHeaders`| A list of permitted request headers for the HTTP input source. By default, the list is empty, which means all headers are allowed in the ingestion specification. |`[]`|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the list is empty, no header is allowed, correct?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default value for druid.ingestion.http.allowedHeaders is empty list which means all headers are allowed.

Copy link
Copy Markdown
Contributor Author

@pranavbhole pranavbhole Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to restrict headers then druid.ingestion.http.allowedHeaders should be non empty.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made changes to modify behavior. Default value for druid.ingestion.http.allowedHeaders is empty which allows no headers to pass.

Comment on lines +104 to +106
if (config.getAllowedHeaders().size() == 0 && requestHeaders.size() > 0) {
message = "You can set the property druid.ingestion.http.allowedHeaders in middle managers or peons to whitelist request headers";
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this block can be removed while keeping the functionality just the same as before.

Comment thread processing/src/main/java/org/apache/druid/data/input/impl/HttpInputSource.java Outdated
pranavbhole and others added 2 commits September 12, 2024 22:46
…InputSource.java

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
abhishekrb19 and others added 20 commits September 17, 2024 16:14
…k layer (apache#17027)

Tasks control the loading of broadcast datasources via BroadcastDatasourceLoadingSpec getBroadcastDatasourceLoadingSpec(). By default, tasks download all broadcast datasources, unless there's an override as with kill and MSQ controller task.

The CLIPeon command line option --loadBroadcastSegments is deprecated in favor of --loadBroadcastDatasourceMode.

Broadcast datasources can be specified in SQL queries through JOIN and FROM clauses, or obtained from other sources such as lookups.To this effect, we have introduced a BroadcastDatasourceLoadingSpec. Finding the set of broadcast datasources during SQL planning will be done in a follow-up, which will apply only to MSQ tasks, so they load only required broadcast datasources. This PR primarily focuses on the skeletal changes around BroadcastDatasourceLoadingSpec and integrating it from the Task interface via CliPeon to SegmentBootstrapper.

Currently, only kill tasks and MSQ controller tasks skip loading broadcast datasources.
apache#17054)

A command line arg -XX:OnOutOfMemoryError='chmod 644 ${project.parent.basedir}/target/*.hprof' was added to collect heap dumps: apache#17029

This arg is causing problems when running tests from Intellij. Intellij doesn't seem to likechmod 644, but this command works as expected in mvn. So as a workaround, add the chmod 644 ${BASE_DIR/target/*.hprof' command in a shell script that can then be executed when OnOutOfMemoryError happens to make Intellij happy.
…cenarios (apache#17026)

* Add window function drill tests for array_concat_agg for empty over scenarios

* Cleanup sqlNativeIncompatible() as it's not needed now

* Address review comment
This PR apache#16890 introduced a change to skip adding tombstone segments to the cache.
It turns out that as a side effect tombstone segments appear unavailable in the console. This happens because availability of a segment in Broker is determined from the metadata cache.

The fix is to keep the segment in the metadata cache but skip them from refresh.

This doesn't affect any functionality as metadata query for tombstone returns empty causing continuous refresh of those segments.
…gated data (apache#17058)

changes:
* CursorHolder.isPreAggregated method indicates that a cursor has pre-aggregated data for all AggregatorFactory specified in a CursorBuildSpec. If true, engines should rewrite the query to use AggregatorFactory.getCombiningAggreggator, and column selector factories will provide selectors with the aggregator interediate type for the aggregator factory name
* Added groupby, timeseries, and topN support for CursorHolder.isPreAggregated
* Added synthetic test since no CursorHolder implementations support isPreAggregated at this point in time
This removes the need to read it from the query context.
1) ControllerQueryKernel: Update readyToReadResults to acknowledge that sorting stages can
   go directly from READING_INPUT to RESULTS_READY.

2) WorkerStageKernel: Ignore RESULTS_COMPLETE if work is already finished, which can happen
   if the transition to FINISHED comes early due to a downstream LIMIT.
As we move towards multi-threaded MSQ workers, it helps for parallelism
to generate more than one partition per worker. That way, we can fully
utilize all worker threads throughout all stages.

The default value is the number of processing threads. Currently, this
is hard-coded to 1 for peons, but that is expected to change in the future.
…pache#17061)

The prior formatting was inconsistent in terms of punctuation and
capitalization.
* QueryResource: Don't close JSON content on error.

Following similar issues fixed in apache#11685 and apache#15880, this patch fixes
a bug where QueryResource would write a closing array marker if it
encountered an exception after starting to push results. This makes it
difficult for callers to detect errors.

The prior patches didn't catch this problem because QueryResource uses
the ObjectMapper in a unique way, through writeValuesAsArray, which
doesn't respect the global AUTO_CLOSE_JSON_CONTENT setting.

* Fix usage of customized ObjectMappers.
* MSQ: Rework memory management.

This patch reworks memory management to better support multi-threaded
workers running in shared JVMs. There are two main changes.

First, processing buffers and threads are moved from a per-JVM model to
a per-worker model. This enables queries to hold processing buffers
without blocking other concurrently-running queries. Changes:

- Introduce ProcessingBuffersSet and ProcessingBuffers to hold the
  per-worker and per-work-order processing buffers (respectively). On Peons,
  this is the JVM-wide processing pool. On Indexers, this is a per-worker
  pool of on-heap buffers. (This change fixes a bug on Indexers where
  excessive processing buffers could be used if MSQ tasks ran concurrently
  with realtime tasks.)

- Add "bufferPool" argument to GroupingEngine#process so a per-worker pool
  can be passed in.

- Add "druid.msq.task.memory.maxThreads" property, which controls the
  maximum number of processing threads to use per task. This allows usage of
  multiple processing buffers per task if admins desire.

- IndexerWorkerContext acquires processingBuffers when creating the FrameContext
  for a work order, and releases them when closing the FrameContext.

- Add "usesProcessingBuffers()" to FrameProcessorFactory so workers know
  how many sets of processing buffers are needed to run a given query.

Second, adjustments to how WorkerMemoryParameters slices up bundles, to
favor more memory for sorting and segment generation. Changes:

- Instead of using same-sized bundles for processing and for sorting,
  workers now use minimally-sized processing bundles (just enough to read
  inputs plus a little overhead). The rest is devoted to broadcast data
  buffering, sorting, and segment-building.

- Segment-building is now limited to 1 concurrent segment per work order.
  This allows each segment-building action to use more memory. Note that
  segment-building is internally multi-threaded to a degree. (Build and
  persist can run concurrently.)

- Simplify frame size calculations by removing the distinction between
  "standard" and "large" frames. The new default frame size is the same
  as the old "standard" frames, 1 MB. The original goal of of the large
  frames was to reduce the number of temporary files during sorting, but
  I think we can achieve the same thing by simply merging a larger number
  of standard frames at once.

- Remove the small worker adjustment that was added in apache#14117 to account
  for an extra frame involved in writing to durable storage. Instead,
  account for the extra frame whenever we are actually using durable storage.

- Cap super-sorter parallelism using the number of output partitions, rather
  than using a hard coded cap at 4. Note that in practice, so far, this cap
  has not been relevant for tasks because they have only been using a single
  processing thread anyway.

* Remove unused import.

* Fix errorprone annotation.

* Fixes for javadocs and inspections.

* Additional test coverage.

* Fix test.
* MSQ: Improved worker cancellation.

Four changes:

1) FrameProcessorExecutor now requires that cancellationIds be registered
   with "registerCancellationId" prior to being used in "runFully" or "runAllFully".

2) FrameProcessorExecutor gains an "asExecutor" method, which allows that
   executor to be used as an executor for future callbacks in such a way
   that respects cancellationId.

3) RunWorkOrder gains a "stop" method, which cancels the current
   cancellationId and closes the current FrameContext. It blocks until
   both operations are complete.

4) Fixes a bug in RunAllFullyWidget where "processorManager.result()" was
   called outside "runAllFullyLock", which could cause it to be called
   out-of-order with "cleanup()" in case of cancellation or other error.

Together, these changes help ensure cancellation does not have races.
Once "cancel" is called for a given cancellationId, all existing processors
and running callbacks are canceled and exit in an orderly manner. Future
processors and callbacks with the same cancellationId are rejected
before being executed.

* Fix test.

* Use execute, which doesn't return, to avoid errorprone complaints.

* Fix some style stuff.

* Further enhancements.

* Fix style.
…apache#17052)

* BaseWorkerClientImpl: Don't attempt to recover from a closed channel.

This patch introduces an exception type "ChannelClosedForWritesException",
which allows the BaseWorkerClientImpl to avoid retrying when the local
channel has been closed. This can happen in cases of cancellation.

* Add some test coverage.

* wip

* Add test coverage.

* Style.
* add DataSchema.Builder to tidy stuff up a bit

* fixes

* fixes

* more style fixes

* review stuff
…nt "views" of the data based on the cursor build spec (apache#17064)

* abstract `IncrementalIndex` cursor stuff to prepare to allow for possibility of using different "views" of the data based on the cursor build spec
changes:
* introduce `IncrementalIndexRowSelector` interface to capture how `IncrementalIndexCursor` and `IncrementalIndexColumnSelectorFactory` read data
* `IncrementalIndex` implements `IncrementalIndexRowSelector`
* move `FactsHolder` interface to separate file
* other minor refactorings
* Speed up FrameFileTest, SuperSorterTest.

These are two heavily parameterized tests that, together, account for
about 60% of runtime in the test suite.

FrameFileTest changes:

1) Cache frame files in a static, rather than building the frame file
   for each parameterization of the test.

2) Adjust TestArrayCursorFactory to cache the signature, rather than
   re-creating it on each call to getColumnCapabilities.

SuperSorterTest changes:

1) Dramatically reduce the number of tests that run with
   "maxRowsPerFrame" = 1. These are particularly slow due to writing so
   many small files. Some still run, since it's useful to test edge cases,
   but much fewer than before.

2) Reduce the "maxActiveProcessors" axis of the test from [1, 2, 4] to
   [1, 3]. The aim is to reduce the number of cases while still getting
   good coverage of the feature.

3) Reduce the "maxChannelsPerProcessor" axis of the test from [2, 3, 8]
   to [2, 7]. The aim is to reduce the number of cases while still getting
   good coverage of the feature.

4) Use in-memory input channels rather than file channels.

5) Defer formatting of assertion failure messages until they are needed.

6) Cache the cursor factory and its signature in a static.

7) Cache sorted test rows (used for verification) in a static.

* It helps to include the file.

* Style.
* Remove some unnecessary JoinableFactoryWrappers.

* Remove unused import.
gianm and others added 16 commits September 17, 2024 16:14
It didn't do anything and also wasn't called.
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Window queries now acknowledge maxSubqueryBytes.
… group by (apache#16658)

Register a Ser-De for RowsAndColumns so that the window operator query running on leaf operators would be transferred properly on the wire. Would fix the empty response given by window queries without group by on the native engine.
* MSQ: Include worker context maps in WorkOrders.

This provides a mechanism to send contexts to workers in long-lived,
shared JVMs that are not part of the task system.

* Style, coverage.
* Remove workerId parameter from postWorkerError.

It was redundant to MSQErrorReport#getTaskId.

* Fix javadoc.
…7074)

* TableInputSpecSlicer changes to support running on Brokers.

Changes:

1) Rename TableInputSpecSlicer to IndexerTableInputSpecSlicer, in anticipation
   of a new implementation being added for controllers running on Brokers.

2) Allow the context to use the WorkerManager to build the TableInputSpecSlicer,
   in anticipation of Brokers wanting to use this to assign segments to servers
   that are already serving those segments.

3) Remove unused DataSegmentTimelineView interface.

4) Add additional javadoc to DataSegmentProvider.

* Style.
…apache#17069)

* MSQ: Properly report errors that occur when starting up RunWorkOrder.

In apache#17046, an exception thrown by RunWorkOrder#startAsync would be ignored
and replaced with a generic CanceledFault. This patch fixes it by retaining
the original error.
Previously, the processor used "remainingChannels" to track the number of
non-null entries of currentFrame. Now, "remainingChannels" tracks the
number of channels that are unfinished.

The difference is subtle. In the previous code, when an input channel
was blocked upon exiting nextFrame(), the "currentFrames" entry would be
null, and therefore the "remainingChannels" variable would be decremented.
After the next await and call to populateCurrentFramesAndTournamentTree(),
"remainingChannels" would be incremented if the channel had become
unblocked after awaiting.

This means that finished(), which returned true if remainingChannels was
zero, would not be reliable if called between nextFrame() and the
next await + populateCurrentFramesAndTournamentTree().

This patch changes things such that finished() is always reliable. This
fixes a regression introduced in PR apache#16911, which added a call to
finished() that was, at that time, unsafe.
* MSQ: Add QueryKitSpec to encapsulate QueryKit params.

This patch introduces QueryKitSpec, an object that encapsulates the
parameters to makeQueryDefinition that are consistent from call to
call. This simplifies things because we avoid passing around all the
components individually.

This patch also splits "maxWorkerCount" into "maxLeafWorkerCount" and
"maxNonLeafWorkerCount", which apply to leaf stages (no other stages as
inputs) and nonleaf stages respectively.

Finally, this patch also rovides a way for ControllerContext to supply a
QueryKitSpec to its liking. It is expected that this will be used by
controllers of quick interactive queries to set maxNonLeafWorkerCount = 1,
which will generate fanning-in query plans.

* Fix javadoc.
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
@abhishekagarwal87 abhishekagarwal87 merged commit d1bd6a8 into apache:master Sep 19, 2024
@adarshsanjeev adarshsanjeev added this to the 32.0.0 milestone Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.