BP-47 (task6): Direct I/O entrylogger support#3263
Conversation
|
|
||
| int ledgersMapSize = sizeBuffer.get().readInt(); | ||
|
|
||
| if (ledgersMapSize <= 0) { |
There was a problem hiding this comment.
why are we adding this check ?
There was a problem hiding this comment.
If there are not ledgers in ledgerMap, we just skip the following reading.
There was a problem hiding this comment.
so this case was not possible before this change ?
btw, it does no harm.
thanks for clarification
| <forkedProcessTimeoutInSeconds>1800</forkedProcessTimeoutInSeconds> | ||
| <rerunFailingTestsCount>${testRetryCount}</rerunFailingTestsCount> | ||
| <useSystemClassLoader>false</useSystemClassLoader> | ||
| <argLine>-Djava.library.path=target/natives</argLine> |
There was a problem hiding this comment.
this should be useless, isn't it ?
| <groupId>org.apache.maven.plugins</groupId> | ||
| <artifactId>maven-failsafe-plugin</artifactId> | ||
| <version>${maven-failsafe-plugin.version}</version> | ||
| <executions> |
There was a problem hiding this comment.
you don't need to add this "executions" flag, otherwise we are going to execute failsafe every time you run Maven.
it is automatically pre-configured, it makes sense to declare it only in order to set a version
eolivelli
left a comment
There was a problem hiding this comment.
I did another pass.
The patch is very big, we need more eyes
@dlg99 @merlimat @rdhabalia @RaulGracia @reddycharan @Ghatage
|
|
||
| int ledgersMapSize = sizeBuffer.get().readInt(); | ||
|
|
||
| if (ledgersMapSize <= 0) { |
There was a problem hiding this comment.
so this case was not possible before this change ?
btw, it does no harm.
thanks for clarification
I think this calls to addition of backwards-compat tests, upgrade/downgrade scenario (data written with these changes can be read by previous version of BK and vice versa) https://github.com/apache/bookkeeper/tree/master/tests/backward-compat |
dlg99
left a comment
There was a problem hiding this comment.
some comments, plus needs backwards compat tests.
| /** | ||
| * O(nlogn) algorithm to find largest contiguous gap between | ||
| * integers in a passed list. n should be relatively small. | ||
| * Entry logs should be about 1GB in size, so even if the node |
There was a problem hiding this comment.
not guaranteed in case of entry log per ledger.
There was a problem hiding this comment.
Yes, it can only be guaranteed by EntryLogManagerForSingleEntryLog
| * Entry logs should be about 1GB in size, so even if the node | ||
| * stores a PB, there should be only 1000000 entry logs. | ||
| */ | ||
| static int[] findLargestGap(List<Integer> currentIds) { |
There was a problem hiding this comment.
use Pair<> instead of int[] because that's what it is
|
|
||
| private void addOutstandingWrite(Future<?> toAdd) throws IOException { | ||
| synchronized (outstandingWrites) { | ||
| outstandingWrites.add(toAdd); |
There was a problem hiding this comment.
There is no limit on number of outstandingWrites.
Typically we start with that and end up adding some kind of limiter or throttler to prevent system overload.
Does it make sense to do it right now?
There was a problem hiding this comment.
Good idea. We can add a limiter for it in the following Pr. This Pr is huge enough, I have tried my best to keep it simple.
| try { | ||
| if (bytesToWrite <= 0) { | ||
| return null; | ||
| } | ||
| int ret = nativeIO.pwrite(fd, bufferToFlush.pointer(), bytesToWrite, offsetToWrite); | ||
| if (ret != bytesToWrite) { | ||
| throw new IOException(exMsg("Incomplete write") | ||
| .kv("filename", filename) | ||
| .kv("pointer", bufferToFlush.pointer()) | ||
| .kv("offset", offsetToWrite) | ||
| .kv("writeSize", bytesToWrite) | ||
| .kv("bytesWritten", ret).toString()); | ||
| } | ||
| } catch (NativeIOException ne) { | ||
| throw new IOException(exMsg(ne.getMessage()) | ||
| .kv("filename", filename) | ||
| .kv("offset", offsetToWrite) | ||
| .kv("writeSize", bytesToWrite) | ||
| .kv("pointer", bufferToFlush.pointer()) | ||
| .kv("errno", ne.getErrno()).toString(), ne); | ||
| } finally { | ||
| bufferPool.release(bufferToFlush); | ||
| } | ||
| return null; |
There was a problem hiding this comment.
same as lines 120+? consider refactoring
There was a problem hiding this comment.
The max line length is 112, so no need to refactor.
There was a problem hiding this comment.
I meant the code at
looks similar, can probably refactor, this will also guarantee "if (bytesToWrite <= 0)" check in both places| final BufferPool bufferPool; | ||
| final ExecutorService writeExecutor; | ||
| final Object bufferLock = new Object(); | ||
| final List<Future<?>> outstandingWrites = new ArrayList<Future<?>>(); |
There was a problem hiding this comment.
can it be ConcurrentLinkedQueue to avoid synchronization later?
There was a problem hiding this comment.
Maybe not. There are two methods addOutstandingWrite and waitForOutstandingWrites using the outstandingWrites. It does not just add and remove operation but uses an iterator to traverse the list content. We need synchronization to keep it in order.
| try { | ||
| fd = nativeIO.open(filename, | ||
| NativeIO.O_CREAT | NativeIO.O_WRONLY | NativeIO.O_DIRECT, | ||
| 00755); |
There was a problem hiding this comment.
00755 as in permissions? why 755 and not e.g. 600 or 640?
"755 means read and execute access for everyone and also write access for the owner of the file."
Why are we granting execute access to this file?
Why read access is needed to everyone?
There was a problem hiding this comment.
644 is better. The current ledger log file's permission is 644, we can keep the same as the previous one.
| flushBuffer(); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
I think you should do waitForOutstandingWrites(); here or call flush() instead of flushBuffer() to confirm that the last outstanding write (submitted e.g. by flushBuffer()) actually succeeded.
Ok, I will add the backwards-compact tests. |
| help = "Operation stats of reading entries from the entry log", | ||
| parent = BOOKIE_READ_ENTRY | ||
| ) | ||
| private final ThreadLocal<OpStatsLogger> readEntryStats; |
There was a problem hiding this comment.
ThreadLocal must be static, because it is actually not bound to a specific instance.
There was a problem hiding this comment.
Those stats which use ThreadLocal need the passed parameter stats, It is hard to make them static
There was a problem hiding this comment.
https://stackoverflow.com/a/2784059/2237794
You are getting per thread/per instance in this case instead of per thread
| // DirectEntryLogger, DefaultEntryLogger and DirectEntryLogger. | ||
| // DirectEntryLogger -> DefaultEntryLogge -> DirectEntryLogger. | ||
| @Test | ||
| public void testCompatFromDirectToDefaultToDirectLogger() throws Exception { |
There was a problem hiding this comment.
@dlg99 I have been added the compat test here
@dlg99 I have added all the compat tests in |
|
All the comments have been addressed, please help take a look. This Pr is huge, we need more eyes. @merlimat @eolivelli @dlg99 @ivankelly @rdhabalia @RaulGracia @reddycharan @Ghatage |
3c93aed to
44f0f1a
Compare
dlg99
left a comment
There was a problem hiding this comment.
Overall LGTM.
I think the non-static ThreadLocal is not doing what you wanted it to do, please check if there are ways to solve this
|
rerun failure checks |
The implementation uses JNI to do direct I/O to files via posix syscalls. Fallocate is used if running on linux, otherwise this is skipped (at the cost of more filesystem operates during writing). There are two calls to write, writeAt and writeDelimited. I expect writeAt to be used for the entrylog headers, which entries will go through writeDelimited. In both cases, the calls may return before the syscalls occur. #flush() needs to be called to ensure things are actually written. The entry log format isn't much changed from what is used by the existing entrylogger. The biggest difference is the padding. Direct I/O must write in aligned blocked. The size of the alignment varies by machine configuration, but 4K is a safe bet on most. As it is unlikely that entry data will land exactly on the alignment boundary, we need to add padding to writes. The existing entry logger has been changed to take this padding into account. When read as a signed int/long/byte the padding will aways parse to a negative value, which distinguishes it from valid entry data (the entry size will always be positive) and also from preallocated space (which is always 0). Another difference in the format is that the header is now 4K rather than 1K. Again, this is to allow aligned rights. No changes are necessary to allow the existing entry logger to deal with the header change, as we create a dummy entry in the extra header space that the existing entry logger already knows to ignore. To enable, set dbStorage_directIOEntryLogger=true in the configuration. (cherry picked from commit 09a1c81)
8adf1ba to
726d359
Compare
|
rerun failure checks |
1 similar comment
|
rerun failure checks |
merlimat
left a comment
There was a problem hiding this comment.
Thanks for carrying on this huge set of changes. Great work!
* Direct I/O entry log support The implementation uses JNI to do direct I/O to files via posix syscalls. Fallocate is used if running on linux, otherwise this is skipped (at the cost of more filesystem operates during writing). There are two calls to write, writeAt and writeDelimited. I expect writeAt to be used for the entrylog headers, which entries will go through writeDelimited. In both cases, the calls may return before the syscalls occur. #flush() needs to be called to ensure things are actually written. The entry log format isn't much changed from what is used by the existing entrylogger. The biggest difference is the padding. Direct I/O must write in aligned blocked. The size of the alignment varies by machine configuration, but 4K is a safe bet on most. As it is unlikely that entry data will land exactly on the alignment boundary, we need to add padding to writes. The existing entry logger has been changed to take this padding into account. When read as a signed int/long/byte the padding will aways parse to a negative value, which distinguishes it from valid entry data (the entry size will always be positive) and also from preallocated space (which is always 0). Another difference in the format is that the header is now 4K rather than 1K. Again, this is to allow aligned rights. No changes are necessary to allow the existing entry logger to deal with the header change, as we create a dummy entry in the extra header space that the existing entry logger already knows to ignore. To enable, set dbStorage_directIOEntryLogger=true in the configuration. (cherry picked from commit 09a1c81) * format code * format code * fix spotbugs check * fix jnilib not found * fix jnilib not found * address comments * fix so lib not found * address comments * format code * add compat test * add compat test * address comments * fix findbugs failed * format code Co-authored-by: Ivan Kelly <ikelly@splunk.com>
Motivation
Task 6 of BP-47, Direct I/O entrylogger support.
This is a standalone module for direct I/O entry logger support.
Modification
The implementation uses JNI to do direct I/O to files via posix syscalls. Fallocate is used if running on linux, otherwise this is skipped (at the cost of more filesystem operates during writing).
There are two calls to write, writeAt and writeDelimited. I expect writeAt to be used for the entrylog headers, which entries will go through writeDelimited. In both cases, the calls may return before the syscalls occur. #flush() needs to be called to ensure things are actually written.
The entry log format isn't much changed from what is used by the existing entrylogger. The biggest difference is the padding. Direct I/O must write in aligned blocked. The size of the alignment varies by machine configuration, but 4K is a safe bet on most. As it is unlikely that entry data will land exactly on the alignment boundary, we need to add padding to writes. The existing entry logger has been changed to take this padding into account. When read as a signed int/long/byte the padding will aways parse to a negative value, which distinguishes it from valid entry data (the entry size will always be positive) and also from preallocated space (which is always 0).
Another difference in the format is that the header is now 4K rather than 1K. Again, this is to allow aligned rights. No changes are necessary to allow the existing entry logger to deal with the header change, as we create a dummy entry in the extra header space that the existing entry logger already knows to ignore.
Others
The commit is made by @ivankelly . This is the sub task of #2932 , which pushed out by @mauricebarnum. However, this PR contains too many changes and the number of code lines reaches 8K+, and it is hard to review. According to your suggestion #2932 (comment), and after communicate with @mauricebarnum, we are planing to divide the PR into 6 PRs, Please refer to #2943 (comment).
However, @mauricebarnum doesn't have enough time to deal with those issues, and we desperately need this feature. After communicated with @mauricebarnum and @merlimat , I help to work on divide the PRs and push them out. We are hoping to contain this feature in BookKeeper 4.16.0