Skip to content

Add support for HttpFirehose#4297

Merged
himanshug merged 3 commits intoapache:masterfrom
jihoonson:http-firehose
May 25, 2017
Merged

Add support for HttpFirehose#4297
himanshug merged 3 commits intoapache:masterfrom
jihoonson:http-firehose

Conversation

@jihoonson
Copy link
Copy Markdown
Contributor

@jihoonson jihoonson commented May 19, 2017

This change is Reviewable

@fjy
Copy link
Copy Markdown
Contributor

fjy commented May 19, 2017

👍

@Override
protected InputStream openObjectStream(URI object) throws IOException
{
return object.toURL().openConnection().getInputStream();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will there be duplication in the case when this returns an InputStream ... which is partially read and then there is a temporary network glitch and consequently an IOException. parent class does the retry (which starts reading data from beginning again) and ends up introducing duplicates ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be fine.

  1. If prefetch is enabled,
    1.1) PrefetchableTextFilesFirehose can fetch http objects in background. When next() is called, it first checks that there are already fetched files in local storage. If exists, it simply chooses a fetched file and returns a LineIterator for that file.
    1.2) If there is no fetched files in local storage but some objects are still remained to be read, PrefetchableTextFilesFirehose fetches one of objects in background immediately. If an IOException occurs while downloading the object, it retries up to the maximum retry count. Finally, PrefetchableTextFilesFirehose returns a LineIterator only when the download operation is successfully finished.
  2. If prefetch is disabled, PrefetchableTextFilesFirehose returns a LineIterator which directly reads the stream opened by openObjectStream(). If there is an IOException, it will throw it and the read will fail.

Does it make sense?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I see from PrefetchTextFilesFirehoseFactory.download(..) that whole contents are downloaded irrespective of cache/prefetch capacity bytes , which is fair to keep implementation simpler . it will be nice to document it in some way though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll add a document about the behavior of PrefetchableTextFilesFirehose somewhere.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the behavior of PrefetchableTextFilesFirehose to its java doc.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on code & design

Comment thread docs/content/ingestion/firehose.md Outdated
|property|description|default|
|--------|-----------|-------|
|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache.|1073741824|
|maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means disabling prefetch.|1073741824|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not clear why, as a user, i need to set 2 capacities. why would I not just have properties like...
long maxCacheCapacityBytes and boolean prefetchEnabled ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache and fetch spaces are managed differently to achieve different goals. That is, cached files are not removed until the jvm terminates for future reads, while fetched files are removed when a connection on firehose is closed.

The goal of prefetching is to improve the throughput of the firehose, so maxFetchCapacityBytes should be tuned sometimes. For example, let me suppose an index task which reads data from an HttpFirehose and generates indexes.

  • If the indexing speed is so fast but http connection is so slow, reading from the firehose will be a bottleneck. In this case, increasing maxFetchCapacityBytes will be helpful.
  • Or, if the indexing speed is not bad but the amount of remaining disk is very little, decreasing maxFetchCapacityBytes might be helpful.

Maybe better to add a document about this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah more documentation could help.

however, as a user I would know how much disk space I am willing to spend on this task and it would be super nice if system automatically decides on the division of how much to use for cache vs prefetch space.
currently user needs to do the tuning that can potentially be done by the system itself.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw: this is not a blocker but something to be thought about and considered before finalizing the current model.

Copy link
Copy Markdown
Contributor Author

@jihoonson jihoonson May 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. It would be great if we can make the PrefetchableTextFilesFirehose smarter in the future.

I added explanations on maxFetchCapacityBytes and maxCacheCapacityBytes.

Copy link
Copy Markdown
Member

@nishantmonu51 nishantmonu51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@himanshug himanshug merged commit 11b7b1b into apache:master May 25, 2017
@fjy fjy added this to the 0.10.1 milestone May 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants