Skip to content

Support inputFormat and inputSource for sampler#8901

Merged
gianm merged 13 commits intoapache:masterfrom
jihoonson:input-format-sampler
Nov 20, 2019
Merged

Support inputFormat and inputSource for sampler#8901
gianm merged 13 commits intoapache:masterfrom
jihoonson:input-format-sampler

Conversation

@jihoonson
Copy link
Copy Markdown
Contributor

@jihoonson jihoonson commented Nov 18, 2019

Description

This PR is to add support of the new interfaces (InputSource and InputFormat) introduced in #8823 for sampler. The key changes are:

  • Renamed FirehoseSampler -> InputSourceSampler.
  • IndexTaskSamplerSpec now supports both old (parser and firehose) and new (inputFormat and inputSource) APIs. You cannot mix them though.
  • SeekableStreamSamplerSpec now supports both parser and inputFormat.
  • To give the same experience with both sampler and Kafka/Kinesis indexing service, KIS now supports both parser and inputFormat. The supervisor will always internally fill parser if it was set in the supervisor spec to create sub tasks. Since the sub task understands both new and old APIs, it should work during the rolling update.
  • SamplerCache is removed now. This is because it caches the raw byte array of each row which makes sense for row-oriented file formats.
  • SamplerResponse now contains a Map of raw data instead of byte[]. This map will be serialized into JSON and sent to the caller like UI.
  • The behavior of SeekableStreamSamplerFirehose has changed. It returned an InputRow of null if it couldn't fetch any data within POLL_TIMEOUT_MS. Now it waits until it fetches any data.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.
  • added integration tests.
  • been tested in a test Druid cluster.
    • Been tested during rolling update by running 1) old overlord + new middleManager and 2) new overlord + old middleManager.


This change is Reviewable

@jihoonson
Copy link
Copy Markdown
Contributor Author

Travis is failing because of #8779.

null,
new KafkaSupervisorIOConfig(
TOPIC,
new JsonInputFormat(new JSONPathSpec(true, ImmutableList.of()), ImmutableMap.of()),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this, and other similar instantiations could just use JSONPathSpec.DEFAULT I think

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

public T next()
{
if (!hasNext()) {
throw new NoSuchElementException();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be kind of ugly if the transforming reader that will be wrapping this iterator from the sampler calls hasNext but this iterator is closed before it calls next. In practice I think this should be ok and handled by ExceptionThrowingIterator of CloseableIterator, but I wonder if there is a more graceful, less exception throwing way to handle this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. I fixed to cache the next item in hasNext().

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm

* Caching the next item.
* Not volatile since {@link #hasNext()} and {@link #next()} are supposed to be called by the same thread.
*/
T next;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be slightly simplified you just check if next is null instead of validNext?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. next is never null here. Tidied up.

);
try (final CloseableIterator<InputRowListPlusRawValues> iterator = reader.sample();
final IncrementalIndex<Aggregator> index = buildIncrementalIndex(nonNullSamplerConfig, nonNullDataSchema);
final Closer closer1 = closer) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This with the closer seems kind of sad, but no great solutions since would be sort of funny to register the tmp dir inside the try after already using it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but we can tidy up later once we move to Java 11 (https://docs.oracle.com/javase/specs/jls/se9/html/jls-14.html#jls-14.20.3).

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Nov 20, 2019

@jihoonson this has conflicts

@jihoonson
Copy link
Copy Markdown
Contributor Author

@fjy thanks, resolved conflicts now.

@gianm gianm added this to the 0.17.0 milestone Nov 20, 2019
@gianm gianm merged commit ac6d703 into apache:master Nov 20, 2019
@jihoonson jihoonson deleted the input-format-sampler branch November 21, 2019 02:12
jon-wei pushed a commit to jon-wei/druid that referenced this pull request Nov 26, 2019
* Support inputFormat and inputSource for sampler

* Cleanup javadocs and names

* fix style

* fix timed shutoff input source reader

* fix timed shutoff input source reader again

* tidy up timed shutoff reader

* unused imports

* fix tc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants