Add support for HttpFirehose#4297
Conversation
|
👍 |
| @Override | ||
| protected InputStream openObjectStream(URI object) throws IOException | ||
| { | ||
| return object.toURL().openConnection().getInputStream(); |
There was a problem hiding this comment.
will there be duplication in the case when this returns an InputStream ... which is partially read and then there is a temporary network glitch and consequently an IOException. parent class does the retry (which starts reading data from beginning again) and ends up introducing duplicates ?
There was a problem hiding this comment.
I think it should be fine.
- If prefetch is enabled,
1.1)PrefetchableTextFilesFirehosecan fetch http objects in background. Whennext()is called, it first checks that there are already fetched files in local storage. If exists, it simply chooses a fetched file and returns aLineIteratorfor that file.
1.2) If there is no fetched files in local storage but some objects are still remained to be read,PrefetchableTextFilesFirehosefetches one of objects in background immediately. If an IOException occurs while downloading the object, it retries up to the maximum retry count. Finally,PrefetchableTextFilesFirehosereturns aLineIteratoronly when the download operation is successfully finished. - If prefetch is disabled,
PrefetchableTextFilesFirehosereturns a LineIterator which directly reads the stream opened byopenObjectStream(). If there is an IOException, it will throw it and the read will fail.
Does it make sense?
There was a problem hiding this comment.
yeah, I see from PrefetchTextFilesFirehoseFactory.download(..) that whole contents are downloaded irrespective of cache/prefetch capacity bytes , which is fair to keep implementation simpler . it will be nice to document it in some way though.
There was a problem hiding this comment.
Thanks. I'll add a document about the behavior of PrefetchableTextFilesFirehose somewhere.
There was a problem hiding this comment.
I added the behavior of PrefetchableTextFilesFirehose to its java doc.
| |property|description|default| | ||
| |--------|-----------|-------| | ||
| |maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache.|1073741824| | ||
| |maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means disabling prefetch.|1073741824| |
There was a problem hiding this comment.
not clear why, as a user, i need to set 2 capacities. why would I not just have properties like...
long maxCacheCapacityBytes and boolean prefetchEnabled ?
There was a problem hiding this comment.
Cache and fetch spaces are managed differently to achieve different goals. That is, cached files are not removed until the jvm terminates for future reads, while fetched files are removed when a connection on firehose is closed.
The goal of prefetching is to improve the throughput of the firehose, so maxFetchCapacityBytes should be tuned sometimes. For example, let me suppose an index task which reads data from an HttpFirehose and generates indexes.
- If the indexing speed is so fast but http connection is so slow, reading from the firehose will be a bottleneck. In this case, increasing
maxFetchCapacityByteswill be helpful. - Or, if the indexing speed is not bad but the amount of remaining disk is very little, decreasing
maxFetchCapacityBytesmight be helpful.
Maybe better to add a document about this?
There was a problem hiding this comment.
yeah more documentation could help.
however, as a user I would know how much disk space I am willing to spend on this task and it would be super nice if system automatically decides on the division of how much to use for cache vs prefetch space.
currently user needs to do the tuning that can potentially be done by the system itself.
There was a problem hiding this comment.
btw: this is not a blocker but something to be thought about and considered before finalizing the current model.
There was a problem hiding this comment.
Sounds good. It would be great if we can make the PrefetchableTextFilesFirehose smarter in the future.
I added explanations on maxFetchCapacityBytes and maxCacheCapacityBytes.
This change is