Skip to content

Fix races in LookupSnapshotTaker, CoordinatorPollingBasicAuthenticatorCacheManager#5344

Merged
b-slim merged 2 commits intoapache:masterfrom
gianm:fix-lookups-snap-taker
Feb 6, 2018
Merged

Fix races in LookupSnapshotTaker, CoordinatorPollingBasicAuthenticatorCacheManager#5344
b-slim merged 2 commits intoapache:masterfrom
gianm:fix-lookups-snap-taker

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Feb 5, 2018

Both were susceptible to the following conditions:

  1. Two JVMs on the same machine (perhaps two peons) could conflict by one reading while the
    other was writing, or by writing to the file at the same time.
  2. One JVM could partially write a file, then crash, leaving a truncated file.

…rCacheManager.

Both were susceptible to the following conditions:

1. Two JVMs on the same machine (perhaps two peons) could conflict by one reading while the
   other was writing, or by writing to the file at the same time.
2. One JVM could partially write a file, then crash, leaving a truncated file.
@gianm gianm added the Bug label Feb 5, 2018
*
* This method is not just thread-safe, but is also safe to use from multiple processes on the same machine.
*/
public static void writeAtomically(final File file, OutputStreamConsumer f) throws IOException
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is no such utility already in Apache Commons IO, Guava, or JDK itself?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked (briefly) but did not find one.

@himanshug
Copy link
Copy Markdown
Contributor

👍 for this PR.

some extra thoughts:
In newer versions of Druid code ... all lookup nodes check with coordinator at startup on what lookups they are supposed to load and lookup snapshots are not as important anymore. they were added originally so that historicals at startup know what lookups to load without waiting for coordinator to tell them later.
only major utility of snapshots at this point is for the corner case where historical process is starting but coordinator is down/non-responding and in that case historical falls back to locally persisted snapshot. however, I'm not sure if that is important enough to justify existence of lookup snapshots feature.

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Feb 5, 2018

only major utility of snapshots at this point is for the corner case where historical process is starting but coordinator is down/non-responding and in that case historical falls back to locally persisted snapshot. however, I'm not sure if that is important enough to justify existence of lookup snapshots feature.

I think it is still important. I have a couple of customers where lookups are critical and in that case, coordinator downtime is not a good excuse for them not being available. (Especially since sometimes rolling upgrade documentation says that all coordinators must be taken down at once)

In fact, to help them out more we may also be looking to add more aggressive on disk caching features, so a server can restore all of its lookup data on boot (not just configuration) without access to any external services.

@himanshug
Copy link
Copy Markdown
Contributor

@gianm thanks for clarifying the usefulness of snapshot of configuration, let us keep it then.

In fact, to help them out more we may also be looking to add more aggressive on disk caching features, so a server can restore all of its lookup data on boot (not just configuration) without access to any external services.

We do that with some of our lookups which are custom extensions and use disk backed caches. In fact, to support improved cleanup of persisted data, I had added #5287 .

glad to know that lookups are getting more adoption :)

@b-slim b-slim merged commit 8c738c7 into apache:master Feb 6, 2018
gianm added a commit to implydata/druid-public that referenced this pull request Feb 6, 2018
…rCacheManager (apache#5344)

* Fix races in LookupSnapshotTaker, CoordinatorPollingBasicAuthenticatorCacheManager.

Both were susceptible to the following conditions:

1. Two JVMs on the same machine (perhaps two peons) could conflict by one reading while the
   other was writing, or by writing to the file at the same time.
2. One JVM could partially write a file, then crash, leaving a truncated file.

* Use StringUtils.format
@himanshug
Copy link
Copy Markdown
Contributor

@gianm while thinking more about it, Druid processes should have different snapshotDirectory configured so that there is never an actual conflict ..as two processes might have different lookup configuration and you wouldn't want one to override another.
for peon processes, I think , snapshotDirectory should be some place within their task specific directories.

@gianm gianm deleted the fix-lookups-snap-taker branch February 6, 2018 18:49
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Feb 6, 2018

@himanshug,

for peon processes, I think , snapshotDirectory should be some place within their task specific directories.

But this would defeat the purpose of the snapshot directory, since it means peons cannot start up properly if the coordinator is unavailable.

What makes the lookup configuration different between different nodes? Is it just the tier? If so- maybe we should have one machine-wide lookup snapshot per tier?

@himanshug
Copy link
Copy Markdown
Contributor

@gianm yes , they can be different for different lookup tier and "technically", different peons can have different lookup tier (e.g. overriding druid properties for peon via task json).

so we either would eventually need to force all peons to be in same lookup tier, this is probably alright as I haven't seen any one exploiting the system to have different lookup tier for different peons or else each of them need their own snapshot directory.

third option of course is to leave things in current state and wait for users hitting specific limitations and then evolve things based on that feedback :)

gianm added a commit to gianm/druid that referenced this pull request Feb 6, 2018
Similar to apache#5344 but for the authorizer instead of the authenticator.
@gianm gianm added this to the 0.12.0 milestone Feb 6, 2018
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Feb 6, 2018

@himanshug raised a follow up with separate files for each lookup tier: #5358

jon-wei pushed a commit to jon-wei/druid that referenced this pull request Feb 7, 2018
…rCacheManager (apache#5344)

* Fix races in LookupSnapshotTaker, CoordinatorPollingBasicAuthenticatorCacheManager.

Both were susceptible to the following conditions:

1. Two JVMs on the same machine (perhaps two peons) could conflict by one reading while the
   other was writing, or by writing to the file at the same time.
2. One JVM could partially write a file, then crash, leaving a truncated file.

* Use StringUtils.format
jon-wei pushed a commit that referenced this pull request Feb 7, 2018
Similar to #5344 but for the authorizer instead of the authenticator.
jon-wei added a commit that referenced this pull request Feb 7, 2018
…rCacheManager (#5344) (#5360)

* Fix races in LookupSnapshotTaker, CoordinatorPollingBasicAuthenticatorCacheManager.

Both were susceptible to the following conditions:

1. Two JVMs on the same machine (perhaps two peons) could conflict by one reading while the
   other was writing, or by writing to the file at the same time.
2. One JVM could partially write a file, then crash, leaving a truncated file.

* Use StringUtils.format
jon-wei pushed a commit to jon-wei/druid that referenced this pull request Feb 7, 2018
Similar to apache#5344 but for the authorizer instead of the authenticator.
jon-wei added a commit that referenced this pull request Feb 7, 2018
…5361)

Similar to #5344 but for the authorizer instead of the authenticator.
jon-wei added a commit to implydata/druid-public that referenced this pull request Feb 7, 2018
…) (apache#5361)

Similar to apache#5344 but for the authorizer instead of the authenticator.
Copy link
Copy Markdown

@crodier crodier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to be more specific here ty vm again.


private static OutputStream uncloseable(final OutputStream out) throws IOException
{
return new FilterOutputStream(out)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In unit tests locally with 15mb files, using a BufferedOutputStream here, takes 1 second vs. ~16 seconds. FilterOutputStream isn't buffered, is too slow for practical use?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants