KAFKA-3522: Add RocksDBTimestampedStore#6149
Conversation
|
Call for review @guozhangwang @bbejeck @vvcephei |
There was a problem hiding this comment.
The purpose of this test is, to catch interface changes if we upgrade RocksDB. Using reflections, we make sure the RocksDBGenericOptionsToDbOptionsColumnFamilyOptionsFacade maps all methods from DBOptions and ColumnFamilyOptions to/from Options correctly.
There was a problem hiding this comment.
very nice!
Actually, can you add that statement as a class javadoc, for posterity?
There was a problem hiding this comment.
Was it possible for you to test the test? I.e., how do you know the test does what it's supposed to do?
There was a problem hiding this comment.
I tested it "manually" by removing one method in RocksDBGenericOptionsToDbOptionsColumnFamilyOptionsFacade -- this lets the test fail. Is this sufficient for you?
There was a problem hiding this comment.
Yep! Sorry, I lost your reply in the shuffle.
That's sufficient. It looked right to me, just wanted to make sure we've seen it fail at least once.
There was a problem hiding this comment.
In contrast to RocksDBStore, we use column families. Thus, we cannot pass in Options object that is used by users to specify custom RocksDB options via StreamsConfig; instead we need to use DBOptions and ColumnFamilyOptions -- thus, we need to translated from Options into those two classed via this helper class.
There was a problem hiding this comment.
We try to open the DB with both column families -- this might fail if only one exist. For this case, we create the CF and retry to open the DB afterwards
There was a problem hiding this comment.
All operations are defined over both CF; ie, we do dual put/get operation to migrate data from default CF (that does not store any timestamps) to the new CF
There was a problem hiding this comment.
This class will get more helper method later...
There was a problem hiding this comment.
This and other stuff in this class is only small code cleanups and reordering of method. There is no actual change.
There was a problem hiding this comment.
nit: I personally don't like the double brace initialization for Java collections, but since this subjective feel free to ignore this comment.
There was a problem hiding this comment.
null indicates not found and should be correct. Let me know if you disagree.
There was a problem hiding this comment.
Yeah I agree, forgot about returning null from store to indicate not found
There was a problem hiding this comment.
why two calls here to db.compactRange? Do we need to make a call for each column family?
There was a problem hiding this comment.
On oversight from my side. I think we can just call it once (will remove the duplicate line).
The docs are not very specific if this triggers compaction for all CF, but I believe yes. There is also an API to compact one CF, but I don't see any reason to use it. Thoughts?
There was a problem hiding this comment.
Thinking about this one more, I actually believe that this would only compact default CF -- I'll add this to the RocksDBAccessor interface and pass in CF handlers.
There was a problem hiding this comment.
For the operations involving the noTimestampColumnFamily maybe we could put a guard condition around each operation if(noTimesampColumnFamilyNotEmpty) {..} and the boolean flag is set when opening the db and check the size of the default column family with something like approximateNumEntries but only checking the noTimestampColumnFamily.
There was a problem hiding this comment.
Addressed this with the new RocksDBAccessor interface.
There was a problem hiding this comment.
Will this work with callers? Will we inspect each value and insert an unknown timestamp for records from default column family?
EDIT: NM I didn't see the RocksDbIterator class below when writing this comment.
There was a problem hiding this comment.
Ok. This is an actual change. Need to make this class static to be reused in new RocksDBTimestampedStore class. For this reason, we need to pass in openIterators in the constructor now.
There was a problem hiding this comment.
Makes sense. If it's static, then we could also extract it to a separate file, which might be better than having one of the siblings define a class that both siblings use.
Also, it would reduce the LOC in this class, which is nice for readability.
2102fef to
391caa9
Compare
|
Updated this with |
bbejeck
left a comment
There was a problem hiding this comment.
Thanks for the updates @mjsax. I've made another pass, and I like the approach of determining when to stop using both column families.
Regarding a test for RocksDBTimestampedStore I have a proposal,
- Open a plain
RocksDBgive it the two expected column families and populate theRocksDB.DEFAULT_COLUMN_FAMILYwith a known number of records, then close the plainRocksDB. - Open an instance of
RocksDBTimestampedStorethen do a series of gets on the known keys and assert all returned with the unknown timestamp, then close. - Open the plain
RocksDB again and assert thatRocksDB.DEFAULT_COLUMN_FAMILY` is empty.
WDYT?
|
@bbejeck Added a test as requested :) |
There was a problem hiding this comment.
restoreAllInternal() is only called here, so I think it's better to inline it.
There was a problem hiding this comment.
Now I see why we need to make name package-private now; think it is okay.
There was a problem hiding this comment.
This test let checkstyle fail because method was too long. Extracted some part into prepareOldStore() to fix checkstyle.
There was a problem hiding this comment.
Similar as above: extracted into own method to avoid checkstyle issue
There was a problem hiding this comment.
Neat!
Although, it seems like more of an adapter than a facade ;)
There was a problem hiding this comment.
I couldn't find the reference that makes it necessary to make this public...
There was a problem hiding this comment.
Makes sense. If it's static, then we could also extract it to a separate file, which might be better than having one of the siblings define a class that both siblings use.
Also, it would reduce the LOC in this class, which is nice for readability.
There was a problem hiding this comment.
This comment seems misplaced. IIUC, the class already restricts the key type to Bytes.
There was a problem hiding this comment.
c&p issue (from RocksDBStore). At some point, the class had generics and it seems while refactoring this we missed to update this JavaDoc. Fixing in both classes.
There was a problem hiding this comment.
I don't think it matters. This implementation is pretty straightforward; it doesn't seem like the bytebuffer really has any advantage.
If the others agree, we should get rid of the todo.
There was a problem hiding this comment.
It the same question as #6151 (comment)
I would like to get a uniform agreement/policy what to use and will update the code accordingly after we made a decision. \cc @guozhangwang
There was a problem hiding this comment.
I'd say just remove the TODO; using either is fine here.
There was a problem hiding this comment.
As discussed in person, I update to use ByteBuffer what should be our default API to use to modify byte arrays.
There was a problem hiding this comment.
very nice!
Actually, can you add that statement as a class javadoc, for posterity?
There was a problem hiding this comment.
Was it possible for you to test the test? I.e., how do you know the test does what it's supposed to do?
f9a5a3b to
110cbf1
Compare
|
Updated this. |
guozhangwang
left a comment
There was a problem hiding this comment.
Made a pass on non-testing code.
There was a problem hiding this comment.
Out of scope of this: we should consider have a KIP in the next release to refactor APIs of config setters as:
public interface RocksDBConfigSetter {
/**
* Set the rocks db options for the provided storeName.
*
* @param storeName the name of the store being configured
* @param options the Rocks DB options
* @param configs the configuration supplied to {@link org.apache.kafka.streams.StreamsConfig}
*/
void setConfig(final String storeName, final DBOptions dbOptions, final ColumnFamilyOptions cfOptions, final Map<String, Object> configs);
}
And then we can choose to call the other constructor of
public Options(final DBOptions dbOptions,
final ColumnFamilyOptions columnFamilyOptions) {..}
There was a problem hiding this comment.
Sure. Can you please create a Jira for tracking? Otherwise it might get dropped.
There was a problem hiding this comment.
There was a problem hiding this comment.
Looked at https://github.com/facebook/rocksdb/blob/master/options/options.cc#L367 there seems no code for setting memtable config here.
I'd suggest we just mimic the code here instead of following the FAQ since that page may well obsoleted.
There was a problem hiding this comment.
Ack. Will update the comment accordingly.
There was a problem hiding this comment.
Meta: this and the existing class still share a lot of common code; I'm wondering if it is possible to just add a flag to the existing class, based on which we can decide whether use Options v.s. DBOptions / CFOptions, and use RocksDBAcccessor.
There was a problem hiding this comment.
Yes. That works. We can share more code.
There was a problem hiding this comment.
Just following on my other idea about collapsing into a single class here: maybe instead of naming it as keyValueWithTimestamp, we just name it as:
- "default" -> version 2.1-
- "2.2" -> version 2.2 to now.
And the flag can just be indicating if it is 1) or 2) above; in the future if we need to do this again we can then have:
- "default" -> version 2.1-
- "2.2" -> version 2.2 - 2.5 (just made that up).
- "3.0" -> version 3.0 - now.
etc.
There was a problem hiding this comment.
We cannot rename default column family (and we cannot delete it either) -- it's there all the time. I can still rename the new CF to 2.2 if you want. Let me know.
There was a problem hiding this comment.
I'd suggest try-catch each line separately since the underlying RocksDBException would not tell you which line actually went wrong, and this piece of info would be very useful for trouble shooting; ditto below.
There was a problem hiding this comment.
I think we can still use write(batch); here for efficiency?
There was a problem hiding this comment.
write(batch) should be done by the caller -- that we call it in the other accessor is a bug -- will rename the method to prepareBatch().
There was a problem hiding this comment.
Not sure I can follow this: could you elaborate a bit more? I was thinking to use the RocksDB's
void write(final WriteOptions writeOpts, final WriteBatch updates)
API for the putAll call, is it possible?
There was a problem hiding this comment.
That is exactly what is use. Compare RocksDB#putAll():
@Override
public void putAll(final List<KeyValue<Bytes, byte[]>> entries) {
try (final WriteBatch batch = new WriteBatch()) {
dbAccessor.prepareBatch(entries, batch);
write(batch);
} catch (final RocksDBException e) {
throw new ProcessorStateException("Error while batch writing to store " + name, e);
}
}
It calls dbAccessor.prepareBatch(entries, batch); (already renamed from putAll -> prepareBatch) that is this method. We do the call "outside" because only who the batch is prepared differs.
Does this make sense
There was a problem hiding this comment.
I'd say just remove the TODO; using either is fine here.
There was a problem hiding this comment.
Which do you think is more accurate, this or db.getLongProperty("rocksdb.estimate-num-keys"); directly?
There was a problem hiding this comment.
If CF is not specified, not all CF all accessed by default CF only -- thus, we need to add both.
There was a problem hiding this comment.
I extended RocksDBTimestampedStoreTest accordingly to test approximateNumEntries(), too.
There was a problem hiding this comment.
We can probably save this class if we consolidate on one RocksDBStore.
There was a problem hiding this comment.
UPDATE: can we just modify RocksDbKeyValueBytesStoreSupplier to allow passing in a flag, based on which we will construct RocksDB v.s. TRocksDB?
300dcc2 to
8d74b82
Compare
|
Updated this. Reworked Rebased to get BloomFilter changes that got merged recently. |
8d74b82 to
c854ee8
Compare
guozhangwang
left a comment
There was a problem hiding this comment.
Made another pass over the PR.
| } | ||
|
|
||
| @Override | ||
| public synchronized KeyValue<Bytes, byte[]> next() { |
There was a problem hiding this comment.
Though this is not introduced in this PR: this function seems not needed?
There was a problem hiding this comment.
Yes. Maybe we should add the check if the iterator is open? Similar to hasNext() ?
There was a problem hiding this comment.
Never mind. super.next() calls hasNext() anyway. This part is covered. Removing this method.
There was a problem hiding this comment.
Now I see why we need to make name package-private now; think it is okay.
| final String name; | ||
| private final String parentDir; | ||
| private final Set<KeyValueIterator> openIterators = Collections.synchronizedSet(new HashSet<>()); | ||
| final Set<KeyValueIterator<Bytes, byte[]>> openIterators = Collections.synchronizedSet(new HashSet<>()); |
There was a problem hiding this comment.
Why openIterators needs to be package-private now?
| public synchronized byte[] get(final Bytes key) { | ||
| validateStoreOpen(); | ||
| try { | ||
| return dbAccessor.get(key.get()); |
There was a problem hiding this comment.
For put / putAll etc the exception is captured inside the dbAccessor, while for get it is captured in the caller, is it intentional?
There was a problem hiding this comment.
If not, we should move the exception capturing logic inside the dbAccessor as well.
There was a problem hiding this comment.
Fair question. If we move all try-catch into the dbAccessor, we need to duplicate more code. For example, db.flush() could throw and catching the exception in RocksDB#flush() is a single place -- if we move it into the dbAccessor, we need to to try-catch in both implementations.
Thoughts?
There was a problem hiding this comment.
put is different, because you requested to catch put() and get() within DualAccessor separately -- thus, I moved the code inside.
For putAll() does also throw (main pattern is that the caller catches).
| final byte[] value = get(key); | ||
| final byte[] oldValue; | ||
| try { | ||
| oldValue = dbAccessor.getOnly(key.get()); |
| } | ||
| } | ||
|
|
||
| private class DualColumnFamilyAccessor implements RocksDBAccessor { |
There was a problem hiding this comment.
nit: maybe we can make it just a general accessor that takes two parameters: oldCF and newCF? Or we can do this generalizing in the future if you'd like to hard-code for now.
There was a problem hiding this comment.
If we make one, how does it now that it need to upgrade stuff or not? Seems to imply a null check for each operation if both or only one CF should be accessed? That would imply runtime overhead. Also, I think the code would be a little bit harder to read. Thoughts?
There was a problem hiding this comment.
We will create the DualColumnFamilyAccessor only when at the construction time if we found there are two CFs and the old CF is not empty right?
There was a problem hiding this comment.
So that means we do not need to do extra check inside this accessor impl right?
There was a problem hiding this comment.
+1 from me as well for putting both accessors into separate classes
There was a problem hiding this comment.
I looked at it closer. I still think it's better to split them out, but I also don't think it's a correctness issue right now, so I'd be fine with merging what you have.
There was a problem hiding this comment.
Let's to some refactoring as follow ups. It's internal anyway. Would like to keep the PRs focused to get stuff in and merged :)
There was a problem hiding this comment.
Actually I was only suggesting to make the current Dual accessor to be more general
currently it assumes the old to be default, and new to be withTimestamp.
what I was suggesting is only to make these two parameterized; so that in the future we only have two accessor impls:
- XX-CF only; which we already do in this PR.
- XX-to-YY upgrade: old XX CF to YY CF upgrade accessor.
All that being said, I'm okay with such refactoring as follow-ups.
There was a problem hiding this comment.
This refactoring is already done: DualAccessor has a constructor with two CF parameters.
There is one missing piece: the conversion function is hard coded. After TimestampedByteStore PR is merged, I will refactor this to pass in a RecordConverter that and will pass in TimestampedByteStore#convertValueToNewFormat()
| nextNoTimestamp = null; | ||
| iterNoTimestamp.next(); | ||
| } else { | ||
| if (comparator.compare(nextNoTimestamp.key.get(), nextWithTimestamp.key.get()) <= 0) { |
There was a problem hiding this comment.
if (comparator.compare(nextNoTimestamp.key.get(), nextWithTimestamp.key.get()) == 0) we need to advance on both ends while only returning the one from with-timestamp iterator, otherwise we may get duplicates returned.
There was a problem hiding this comment.
This should never happen, because each key should be stored only once. Either in old or new CF. Or do I miss something?
There was a problem hiding this comment.
I think it is still possible. Here's one scenario:
- last checkpoint at offset 100; all writes goes to old CF.
- continue writes to old CF til offset 110, but no checkpoint written yet.
- non-graceful shutdown happens, and upon restarting new CF is used.
- we start restoring from offset 100 to log-end-offset 110, to the new CF.
Now we ended with data of offsets 100-110 in both CFs.
There was a problem hiding this comment.
In step (4), we also delete on old CF. Compare https://github.com/apache/kafka/pull/6149/files#diff-be43584f61033a47b5422775f5c5efdaR154
There was a problem hiding this comment.
I see. Do we guarantee that concurrent IQ will not see duplicated results with the db-accessor updating logic as well? If yes, we can save this check.
There was a problem hiding this comment.
That should be safe. Note, that IQ is "disabled" if we are not in RUNNING state.
There was a problem hiding this comment.
UPDATE: can we just modify RocksDbKeyValueBytesStoreSupplier to allow passing in a flag, based on which we will construct RocksDB v.s. TRocksDB?
| } | ||
|
|
||
| private KeyValue<Bytes, byte[]> getKeyValueNoTimestamp() { | ||
| return new KeyValue<>(new Bytes(iterNoTimestamp.key()), getValueWithUnknownTimestamp(iterNoTimestamp.value())); |
There was a problem hiding this comment.
I'd suggest one optimization here, in order to save array-copying every time while we scan over the old CF: we keep the original key and value bytes for no-timestamp iterator, and only when we've already decided to assign from this iterator to next then we will call getValueWithUnknownTimestamp to do the bytes array copying.
|
|
||
|
|
||
| // iterating should not migrate any data, but return all key over both CF (plus surrogate timestamps for old CF) | ||
| final KeyValueIterator<Bytes, byte[]> it = rocksDBStore.all(); |
There was a problem hiding this comment.
Could we also test for range(from, to) queries?
| private ColumnFamilyHandle noTimestampColumnFamily; | ||
| private ColumnFamilyHandle withTimestampColumnFamily; | ||
|
|
||
| RocksDBTimestampedStore(final String name) { |
There was a problem hiding this comment.
👍 It looks like the two-arg constructor is unused.
| final ColumnFamilyOptions columnFamilyOptions) { | ||
| final List<ColumnFamilyDescriptor> columnFamilyDescriptors = asList( | ||
| new ColumnFamilyDescriptor(RocksDB.DEFAULT_COLUMN_FAMILY, columnFamilyOptions), | ||
| new ColumnFamilyDescriptor("keyValueWithTimestamp".getBytes(StandardCharsets.UTF_8), columnFamilyOptions)); |
There was a problem hiding this comment.
Did @guozhangwang suggest to rename this DF to 2.2? I actually think the descriptive name might be better. It seems like it'll be less work in the long run to remember what exactly is different about the different CFs.
There was a problem hiding this comment.
He did suggest but there was no final agreement. I personally don't care too much.
There was a problem hiding this comment.
what about naming keyValueWithTimestamp_2.2?
There was a problem hiding this comment.
I don't feel strong about my suggestion either :) Will leave it to anyone who has a strong feeling here.
| } | ||
| } | ||
|
|
||
| private class DualColumnFamilyAccessor implements RocksDBAccessor { |
There was a problem hiding this comment.
Personally, I like the existing design. The DualColumnFamilyAccessor has to do a lot of extra checking that isn't necessary if there's just one cf to deal with. If we collapse them into one class with a flag, we pay for it with a lot of branching.
One thing I did find confusing was reasoning about the fact that the Dual accessor is embedded in this (Timestamped) class, and the Single accessor is embedded in the parent (non-timestamped) class. But, we're using it as an accessor for this (the child) class. This seems unnecessarily convoluted, and it's a little hard to see if it's actually ok, or just coincidentally ok, since the parent and child APIs are only semantically, rather than actually, different.
It seems simpler to understand if we pull both accessors out into separate classes that take db, name, options, etc as constructor arguments, rather than closing over protected state.
bbejeck
left a comment
There was a problem hiding this comment.
Thanks for the update, Overall looks good to me, I want to take another pass over the unit test later this evening.
| final ColumnFamilyOptions columnFamilyOptions) { | ||
| final List<ColumnFamilyDescriptor> columnFamilyDescriptors = asList( | ||
| new ColumnFamilyDescriptor(RocksDB.DEFAULT_COLUMN_FAMILY, columnFamilyOptions), | ||
| new ColumnFamilyDescriptor("keyValueWithTimestamp".getBytes(StandardCharsets.UTF_8), columnFamilyOptions)); |
There was a problem hiding this comment.
what about naming keyValueWithTimestamp_2.2?
| } | ||
| } | ||
|
|
||
| private class DualColumnFamilyAccessor implements RocksDBAccessor { |
There was a problem hiding this comment.
+1 from me as well for putting both accessors into separate classes
| public class RocksDBTimestampedStoreTest extends RocksDBStoreTest { | ||
| final String unknownTimestampString = new String(new LongSerializer().serialize(null, ConsumerRecord.NO_TIMESTAMP)); | ||
|
|
||
| String getTimestampPrefix() { |
bbejeck
left a comment
There was a problem hiding this comment.
One minor nit otherwise LGTM.
| // approx: 4 entries on old CF, 1 in new CF | ||
| assertThat(rocksDBStore.approximateNumEntries(), is(5L)); | ||
|
|
||
| // should add new key10 to new CF |
There was a problem hiding this comment.
nit: comment states key10 added, but actual key put is key8
| } | ||
|
|
||
| if (nextNoTimestamp == null) { | ||
| if (nextWithTimestamp == null) { |
There was a problem hiding this comment.
should both iterators also be reporting !isValid here as well? I'm finding he rocksdb iterator api a little confusing...
I guess if we never allow a null key into the store, then this is an effective way to check for the end of the iteration.
There was a problem hiding this comment.
null keys are not allowed -- that is the assumption in this check. We can add the !isValid as safe guard if you want.
There was a problem hiding this comment.
I don't feel strongly about it. If we enforce the "no null keys" invariant, then they are equivalent.
It seems mildly confusing that we essentially have two different methods of determining when the iterator has run out of data. I leave it up to you.
|
Updated this. Will merge when Jenkins is green. |
|
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/19162/ Retest this please. |
* AK/trunk: fix typo (apache#5150) MINOR: Reduce replica.fetch.backoff.ms in ReassignPartitionsClusterTest (apache#5887) KAFKA-7766: Fail fast PR builds (apache#6059) KAFKA-7798: Expose embedded clientIds (apache#6107) KAFKA-7641; Introduce "group.max.size" config to limit group sizes (apache#6163) KAFKA-7433; Introduce broker options in TopicCommand to use AdminClient (KIP-377) MINOR: Fix some field definitions for ListOffsetReponse (apache#6214) KAFKA-7873; Always seek to beginning in KafkaBasedLog (apache#6203) KAFKA-7719: Improve fairness in SocketServer processors (KIP-402) (apache#6022) MINOR: fix checkstyle suppressions for generated RPC code to work on Windows KAFKA-7859: Use automatic RPC generation in LeaveGroups (apache#6188) KAFKA-7652: Part II; Add single-point query for SessionStore and use for flushing / getter (apache#6161) KAFKA-3522: Add RocksDBTimestampedStore (apache#6149) KAFKA-3522: Replace RecordConverter with TimestampedBytesStore (apache#6204)
Reviewers: Bill Bejeck <bill@confluent.io>, John Roesler <john@confluent.io>, Guozhang Wang <guozhang@confluent.io>
Part of KIP-258. adds only internal classes (we can merge right away).