Skip to content

KAFKA-6878 NPE when querying global state store not in READY state#4978

Merged
guozhangwang merged 3 commits intoapache:trunkfrom
tedyu:trunk
May 9, 2018
Merged

KAFKA-6878 NPE when querying global state store not in READY state#4978
guozhangwang merged 3 commits intoapache:trunkfrom
tedyu:trunk

Conversation

@tedyu
Copy link
Copy Markdown
Contributor

@tedyu tedyu commented May 8, 2018

Check whether cache is null before retrieving from cache.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@mjsax mjsax requested review from guozhangwang and mjsax May 8, 2018 17:27
@mjsax mjsax added the streams label May 8, 2018
@mjsax
Copy link
Copy Markdown
Member

mjsax commented May 8, 2018

Call for review @bbejeck @vvcephei

@guozhangwang
Copy link
Copy Markdown
Contributor

guozhangwang commented May 8, 2018

Should we consider two other CachingWindowed and CachingSessioned store classes as well?

A meta comment: instead of checking on cache object not null, should we consider including the init function to be protected by the write lock as well (@tedyu you improved on the locking mechanism here so I'd like to have your feedback)

@tedyu
Copy link
Copy Markdown
Contributor Author

tedyu commented May 8, 2018

I can apply similar change to CachingWindowStore and CachingSessionStore if people think the check is beneficial.

w.r.t. additional locking, I tend to leave that to future JIRA. Additional locking increases the complexity of the code.
Also, with current change, client code can detect the completion of initialization, making the additional locking unnecessary (at least for now).

@tedyu
Copy link
Copy Markdown
Contributor Author

tedyu commented May 8, 2018

With additional locking, client query may take arbitrarily long time.
This is change in behavior which I hesitate in introducing.

@bbejeck
Copy link
Copy Markdown
Member

bbejeck commented May 8, 2018

Java 8 error caused by out of memory.

retest this please.

Copy link
Copy Markdown
Member

@bbejeck bbejeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tedyu, overall LGTM
+1 for adding CachingWindowStore and CachingSessionStore
Also, can we add a test to verify behavior?

@tedyu
Copy link
Copy Markdown
Contributor Author

tedyu commented May 8, 2018

Looking at CachingKeyValueStoreTest, store.init is called in setUp method.

Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, with current change, client code can detect the completion of initialization, making the additional locking unnecessary (at least for now).

I'm a bit unclear how the client can detect if the initialization is completed? If get returns null it could mean 1) init has done, but the key / window / session does not exist, 2) init is not done. How to distinguish these two cases?

validateStoreOpen();
final Bytes bytesKey = WindowKeySchema.toStoreKeyBinary(key, timestamp, 0);
final Bytes cacheKey = cacheFunction.cacheKey(bytesKey);
if (cache == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In CachingSessionStore#findSessions, we could also check if cache is null.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at CachingSessionStore#findSessions, an Iterator is returned.

        final ThreadCache.MemoryLRUCacheBytesIterator cacheIterator = cache.range(cacheName, cacheKeyFrom, cacheKeyTo);

When cache is null, should a special MemoryLRUCacheBytesIterator be returned ?

@tedyu
Copy link
Copy Markdown
Contributor Author

tedyu commented May 9, 2018

Updated change in CachingKeyValueStore so that we rely on underlying for query when cache is null.

@tedyu
Copy link
Copy Markdown
Contributor Author

tedyu commented May 9, 2018

With current change, client is not concerned with completion of initialization.
If initialization hasn't completed, the query would take a bit longer.

@guozhangwang
Copy link
Copy Markdown
Contributor

Thanks @tedyu , the change lgtm. Merging to trunk.

@guozhangwang guozhangwang merged commit e32dcb9 into apache:trunk May 9, 2018
guozhangwang pushed a commit that referenced this pull request May 9, 2018
…4978)

Check whether cache is null before retrieving from cache.

Reviewers: Guozhang Wang <guozhang@confluent.io>, Bill Bejeck <bill@confluent.io>
@guozhangwang
Copy link
Copy Markdown
Contributor

Also cherry-picked to 1.1

@tedyu I realized that in CachingWindowStore there are a couple of other fetch functions that need this safety check as well:

fetch(final Bytes key, final long timeFrom, final long timeTo)

fetch(final Bytes from, final Bytes to, final long timeFrom, final long timeTo)

If the cache is not available, we could skip returning a wrapper iterator that merges the two iterators from underlying and the cache, but just return the underlying iterator directly. Could you submit another PR for this?

@mjsax
Copy link
Copy Markdown
Member

mjsax commented May 9, 2018

Does it make sense to delegate the query to the underlying store? The issue about the non-initialed cache is, that the store itself was not initialized yet. Thus, the underlying store is not initialized either.

From my point of view, as long as as a store is not initialized and fully recovered, it should not be possible to query the store. We have some check in place that validate that stores are opened but we are hitting a race condition for this case. Overall, I think that we should throw an InvalidStateStoreException -- or something more precise (cf. https://cwiki.apache.org/confluence/display/KAFKA/KIP-216%3A+IQ+should+throw+different+exceptions+for+different+errors)

@tedyu
Copy link
Copy Markdown
Contributor Author

tedyu commented May 9, 2018

Since underlying is passed to CachingKeyValueStore ctor, it is usable before initInternal finishes.

@guozhangwang
Copy link
Copy Markdown
Contributor

The ordering of initialization is the following:

  1. underlying.init where underlying.isOpen is set to true;
  2. assign cache

The ordering of get is the following:

  1. check underlying.isOpen;
  2. try to access cache.

So I think the race condition that @mjsax was talking about is around 1 -> 3 -> 2 -> 4.

I think to fix this issue, in init we could consider switching the steps of 1 and 2:

initInternal(context);
underlying.init(context, root);

since

volatile boolean open = false;

it should be sufficient. In this case the check on step 3) will fail if underlying.init is not completed and we will throw InvalidStateStoreException.

@tedyu
Copy link
Copy Markdown
Contributor Author

tedyu commented May 9, 2018

Opened #4988

guozhangwang pushed a commit that referenced this pull request May 10, 2018
This is continuation of #4978.
From Guozhang:

I think to fix this issue, in init we could consider switching the steps of 1 and 2:

initInternal(context);
underlying.init(context, root);

since

volatile boolean open = false;

it should be sufficient. In this case the check on step 3) will fail if underlying.init is not completed and we will throw InvalidStateStoreException.

Reviewers: Guozhang Wang <wangguoz@gmail.com>
guozhangwang pushed a commit that referenced this pull request May 10, 2018
This is continuation of #4978.
From Guozhang:

I think to fix this issue, in init we could consider switching the steps of 1 and 2:

initInternal(context);
underlying.init(context, root);

since

volatile boolean open = false;

it should be sufficient. In this case the check on step 3) will fail if underlying.init is not completed and we will throw InvalidStateStoreException.

Reviewers: Guozhang Wang <wangguoz@gmail.com>
ijuma added a commit to ijuma/kafka that referenced this pull request May 11, 2018
…-record-version

* apache-github/trunk:
  KAFKA-6894: Improve err msg when connecting processor with global store (apache#5000)
  KAFKA-6893; Create processors before starting acceptor in SocketServer (apache#4999)
  MINOR: Fix typo in ConsumerRebalanceListener JavaDoc (apache#4996)
  MINOR: Remove deprecated valueTransformer.punctuate (apache#4993)
  MINOR: Update dynamic broker configuration doc for truststore update (apache#4954)
  KAFKA-6870 Concurrency conflicts in SampledStat (apache#4985)
  KAFKA-6361: Fix log divergence between leader and follower after fast leader fail over (apache#4882)
  KAFKA-6813: Remove deprecated APIs in KIP-182, Part II (apache#4976)
  KAFKA-6878 Switch the order of underlying.init and initInternal (apache#4988)
  KAFKA-6299; Fix AdminClient error handling when metadata changes (apache#4295)
  KAFKA-6878: NPE when querying global state store not in READY state (apache#4978)
  KAFKA 6673: Implemented missing override equals method (apache#4745)
  KAFKA-6834: Handle compaction with batches bigger than max.message.bytes (apache#4953)
ying-zheng pushed a commit to ying-zheng/kafka that referenced this pull request Jul 6, 2018
…pache#4978)

Check whether cache is null before retrieving from cache.

Reviewers: Guozhang Wang <guozhang@confluent.io>, Bill Bejeck <bill@confluent.io>
ying-zheng pushed a commit to ying-zheng/kafka that referenced this pull request Jul 6, 2018
…he#4988)

This is continuation of apache#4978.
From Guozhang:

I think to fix this issue, in init we could consider switching the steps of 1 and 2:

initInternal(context);
underlying.init(context, root);

since

volatile boolean open = false;

it should be sufficient. In this case the check on step 3) will fail if underlying.init is not completed and we will throw InvalidStateStoreException.

Reviewers: Guozhang Wang <wangguoz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants