BP-47 (task6): Direct I/O entrylogger support by hangc0276 · Pull Request #3263 · apache/bookkeeper

hangc0276 · 2022-05-09T02:15:44Z

Motivation

Task 6 of BP-47, Direct I/O entrylogger support.

This is a standalone module for direct I/O entry logger support.

Modification

The implementation uses JNI to do direct I/O to files via posix syscalls. Fallocate is used if running on linux, otherwise this is skipped (at the cost of more filesystem operates during writing).

There are two calls to write, writeAt and writeDelimited. I expect writeAt to be used for the entrylog headers, which entries will go through writeDelimited. In both cases, the calls may return before the syscalls occur. #flush() needs to be called to ensure things are actually written.

The entry log format isn't much changed from what is used by the existing entrylogger. The biggest difference is the padding. Direct I/O must write in aligned blocked. The size of the alignment varies by machine configuration, but 4K is a safe bet on most. As it is unlikely that entry data will land exactly on the alignment boundary, we need to add padding to writes. The existing entry logger has been changed to take this padding into account. When read as a signed int/long/byte the padding will aways parse to a negative value, which distinguishes it from valid entry data (the entry size will always be positive) and also from preallocated space (which is always 0).

Another difference in the format is that the header is now 4K rather than 1K. Again, this is to allow aligned rights. No changes are necessary to allow the existing entry logger to deal with the header change, as we create a dummy entry in the extra header space that the existing entry logger already knows to ignore.

Others

The commit is made by @ivankelly . This is the sub task of #2932 , which pushed out by @mauricebarnum. However, this PR contains too many changes and the number of code lines reaches 8K+, and it is hard to review. According to your suggestion #2932 (comment), and after communicate with @mauricebarnum, we are planing to divide the PR into 6 PRs, Please refer to #2943 (comment).

However, @mauricebarnum doesn't have enough time to deal with those issues, and we desperately need this feature. After communicated with @mauricebarnum and @merlimat , I help to work on divide the PRs and push them out. We are hoping to contain this feature in BookKeeper 4.16.0

eolivelli · 2022-05-09T08:24:35Z


                int ledgersMapSize = sizeBuffer.get().readInt();
-
+                if (ledgersMapSize <= 0) {


why are we adding this check ?

If there are not ledgers in ledgerMap, we just skip the following reading.

so this case was not possible before this change ?
btw, it does no harm.
thanks for clarification

eolivelli · 2022-05-09T08:30:44Z

          <forkedProcessTimeoutInSeconds>1800</forkedProcessTimeoutInSeconds>
          <rerunFailingTestsCount>${testRetryCount}</rerunFailingTestsCount>
+          <useSystemClassLoader>false</useSystemClassLoader>
+          <argLine>-Djava.library.path=target/natives</argLine>


this should be useless, isn't it ?

eolivelli · 2022-05-09T08:31:33Z

+        <groupId>org.apache.maven.plugins</groupId>
+        <artifactId>maven-failsafe-plugin</artifactId>
+        <version>${maven-failsafe-plugin.version}</version>
+        <executions>


you don't need to add this "executions" flag, otherwise we are going to execute failsafe every time you run Maven.

it is automatically pre-configured, it makes sense to declare it only in order to set a version

eolivelli

I did another pass.
The patch is very big, we need more eyes

@dlg99 @merlimat @rdhabalia @RaulGracia @reddycharan @Ghatage

eolivelli · 2022-05-10T06:32:47Z


                int ledgersMapSize = sizeBuffer.get().readInt();
-
+                if (ledgersMapSize <= 0) {


so this case was not possible before this change ?
btw, it does no harm.
thanks for clarification

dlg99 · 2022-05-10T18:59:54Z

The biggest difference is the padding. Direct I/O must write in aligned blocked.
Another difference in the format is that the header is now 4K rather than 1K. Again, this is to allow aligned rights. No changes are necessary to allow the existing entry logger to deal with the header change, as we create a dummy entry in the extra header space that the existing entry logger already knows to ignore.

I think this calls to addition of backwards-compat tests, upgrade/downgrade scenario (data written with these changes can be read by previous version of BK and vice versa)

https://github.com/apache/bookkeeper/tree/master/tests/backward-compat

dlg99

some comments, plus needs backwards compat tests.

dlg99 · 2022-05-10T20:25:40Z

+    /**
+     * O(nlogn) algorithm to find largest contiguous gap between
+     * integers in a passed list. n should be relatively small.
+     * Entry logs should be about 1GB in size, so even if the node


not guaranteed in case of entry log per ledger.

Yes, it can only be guaranteed by EntryLogManagerForSingleEntryLog

dlg99 · 2022-05-10T23:19:24Z

+     * Entry logs should be about 1GB in size, so even if the node
+     * stores a PB, there should be only 1000000 entry logs.
+     */
+    static int[] findLargestGap(List<Integer> currentIds) {


use Pair<> instead of int[] because that's what it is

dlg99 · 2022-05-10T23:53:46Z

+
+    private void addOutstandingWrite(Future<?> toAdd) throws IOException {
+        synchronized (outstandingWrites) {
+            outstandingWrites.add(toAdd);


There is no limit on number of outstandingWrites.
Typically we start with that and end up adding some kind of limiter or throttler to prevent system overload.
Does it make sense to do it right now?

Good idea. We can add a limiter for it in the following Pr. This Pr is huge enough, I have tried my best to keep it simple.

dlg99 · 2022-05-10T23:59:41Z

+                        try {
+                            if (bytesToWrite <= 0) {
+                                return null;
+                            }
+                            int ret = nativeIO.pwrite(fd, bufferToFlush.pointer(), bytesToWrite, offsetToWrite);
+                            if (ret != bytesToWrite) {
+                                throw new IOException(exMsg("Incomplete write")
+                                                      .kv("filename", filename)
+                                                      .kv("pointer", bufferToFlush.pointer())
+                                                      .kv("offset", offsetToWrite)
+                                                      .kv("writeSize", bytesToWrite)
+                                                      .kv("bytesWritten", ret).toString());
+                            }
+                        } catch (NativeIOException ne) {
+                            throw new IOException(exMsg(ne.getMessage())
+                                                  .kv("filename", filename)
+                                                  .kv("offset", offsetToWrite)
+                                                  .kv("writeSize", bytesToWrite)
+                                                  .kv("pointer", bufferToFlush.pointer())
+                                                  .kv("errno", ne.getErrno()).toString(), ne);
+                        } finally {
+                            bufferPool.release(bufferToFlush);
+                        }
+                        return null;


same as lines 120+? consider refactoring

The max line length is 112, so no need to refactor.

I meant the code at

bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/directentrylogger/DirectWriter.java

Lines 118 to 138 in 44f0f1a

Future<?> f = writeExecutor.submit(() -> {

try {

int ret = nativeIO.pwrite(fd, tmpBuffer.pointer(), bytesToWrite, offset);

if (ret != bytesToWrite) {

throw new IOException(exMsg("Incomplete write")

.kv("filename", filename)

.kv("writeSize", bytesToWrite)

.kv("bytesWritten", ret)

.kv("offset", offset).toString());

}

} catch (NativeIOException ne) {

throw new IOException(exMsg("Write error")

.kv("filename", filename)

.kv("writeSize", bytesToWrite)

.kv("errno", ne.getErrno())

.kv("offset", offset).toString());

} finally {

bufferPool.release(tmpBuffer);

}

return null;

});

looks similar, can probably refactor, this will also guarantee "if (bytesToWrite <= 0)" check in both places

dlg99 · 2022-05-11T00:07:19Z

+    final BufferPool bufferPool;
+    final ExecutorService writeExecutor;
+    final Object bufferLock = new Object();
+    final List<Future<?>> outstandingWrites = new ArrayList<Future<?>>();


can it be ConcurrentLinkedQueue to avoid synchronization later?

Maybe not. There are two methods addOutstandingWrite and waitForOutstandingWrites using the outstandingWrites. It does not just add and remove operation but uses an iterator to traverse the list content. We need synchronization to keep it in order.

dlg99 · 2022-05-11T00:29:59Z

+        try {
+            fd = nativeIO.open(filename,
+                               NativeIO.O_CREAT | NativeIO.O_WRONLY | NativeIO.O_DIRECT,
+                               00755);


00755 as in permissions? why 755 and not e.g. 600 or 640?
"755 means read and execute access for everyone and also write access for the owner of the file."
Why are we granting execute access to this file?
Why read access is needed to everyone?

644 is better. The current ledger log file's permission is 644, we can keep the same as the previous one.

dlg99 · 2022-05-11T00:38:53Z

+                flushBuffer();
+            }
+        }
+


I think you should do waitForOutstandingWrites(); here or call flush() instead of flushBuffer() to confirm that the last outstanding write (submitted e.g. by flushBuffer()) actually succeeded.

hangc0276 · 2022-06-12T11:58:09Z

The biggest difference is the padding. Direct I/O must write in aligned blocked.
Another difference in the format is that the header is now 4K rather than 1K. Again, this is to allow aligned rights. No changes are necessary to allow the existing entry logger to deal with the header change, as we create a dummy entry in the extra header space that the existing entry logger already knows to ignore.

I think this calls to addition of backwards-compat tests, upgrade/downgrade scenario (data written with these changes can be read by previous version of BK and vice versa)

https://github.com/apache/bookkeeper/tree/master/tests/backward-compat

Ok, I will add the backwards-compact tests.

eolivelli · 2022-06-15T07:02:43Z

+              help = "Operation stats of reading entries from the entry log",
+              parent = BOOKIE_READ_ENTRY
+    )
+    private final ThreadLocal<OpStatsLogger> readEntryStats;


ThreadLocal must be static, because it is actually not bound to a specific instance.

Those stats which use ThreadLocal need the passed parameter stats, It is hard to make them static

https://stackoverflow.com/a/2784059/2237794
You are getting per thread/per instance in this case instead of per thread

hangc0276 · 2022-06-19T06:51:36Z

+    // DirectEntryLogger, DefaultEntryLogger and DirectEntryLogger.
+    // DirectEntryLogger -> DefaultEntryLogge -> DirectEntryLogger.
+    @Test
+    public void testCompatFromDirectToDefaultToDirectLogger() throws Exception {


@dlg99 I have been added the compat test here

hangc0276 · 2022-06-19T06:54:40Z

The biggest difference is the padding. Direct I/O must write in aligned blocked.
Another difference in the format is that the header is now 4K rather than 1K. Again, this is to allow aligned rights. No changes are necessary to allow the existing entry logger to deal with the header change, as we create a dummy entry in the extra header space that the existing entry logger already knows to ignore.

I think this calls to addition of backwards-compat tests, upgrade/downgrade scenario (data written with these changes can be read by previous version of BK and vice versa)

https://github.com/apache/bookkeeper/tree/master/tests/backward-compat

@dlg99 I have added all the compat tests in TestDirectEntryLoggerCompat to ensure the DefaultEntryLogger and DirectEntryLogger switch between each other doesn't impact the entry read and write.

hangc0276 · 2022-06-19T07:25:14Z

All the comments have been addressed, please help take a look. This Pr is huge, we need more eyes.

@merlimat @eolivelli @dlg99 @ivankelly @rdhabalia @RaulGracia @reddycharan @Ghatage

eolivelli

LGTM

dlg99

Overall LGTM.
I think the non-static ThreadLocal is not doing what you wanted it to do, please check if there are ways to solve this

hangc0276 · 2022-06-22T07:13:43Z

rerun failure checks

The implementation uses JNI to do direct I/O to files via posix syscalls. Fallocate is used if running on linux, otherwise this is skipped (at the cost of more filesystem operates during writing). There are two calls to write, writeAt and writeDelimited. I expect writeAt to be used for the entrylog headers, which entries will go through writeDelimited. In both cases, the calls may return before the syscalls occur. #flush() needs to be called to ensure things are actually written. The entry log format isn't much changed from what is used by the existing entrylogger. The biggest difference is the padding. Direct I/O must write in aligned blocked. The size of the alignment varies by machine configuration, but 4K is a safe bet on most. As it is unlikely that entry data will land exactly on the alignment boundary, we need to add padding to writes. The existing entry logger has been changed to take this padding into account. When read as a signed int/long/byte the padding will aways parse to a negative value, which distinguishes it from valid entry data (the entry size will always be positive) and also from preallocated space (which is always 0). Another difference in the format is that the header is now 4K rather than 1K. Again, this is to allow aligned rights. No changes are necessary to allow the existing entry logger to deal with the header change, as we create a dummy entry in the extra header space that the existing entry logger already knows to ignore. To enable, set dbStorage_directIOEntryLogger=true in the configuration. (cherry picked from commit 09a1c81)

hangc0276 · 2022-06-22T15:28:05Z

rerun failure checks

hangc0276 · 2022-06-23T00:12:42Z

rerun failure checks

merlimat

Thanks for carrying on this huge set of changes. Great work!

* Direct I/O entry log support The implementation uses JNI to do direct I/O to files via posix syscalls. Fallocate is used if running on linux, otherwise this is skipped (at the cost of more filesystem operates during writing). There are two calls to write, writeAt and writeDelimited. I expect writeAt to be used for the entrylog headers, which entries will go through writeDelimited. In both cases, the calls may return before the syscalls occur. #flush() needs to be called to ensure things are actually written. The entry log format isn't much changed from what is used by the existing entrylogger. The biggest difference is the padding. Direct I/O must write in aligned blocked. The size of the alignment varies by machine configuration, but 4K is a safe bet on most. As it is unlikely that entry data will land exactly on the alignment boundary, we need to add padding to writes. The existing entry logger has been changed to take this padding into account. When read as a signed int/long/byte the padding will aways parse to a negative value, which distinguishes it from valid entry data (the entry size will always be positive) and also from preallocated space (which is always 0). Another difference in the format is that the header is now 4K rather than 1K. Again, this is to allow aligned rights. No changes are necessary to allow the existing entry logger to deal with the header change, as we create a dummy entry in the extra header space that the existing entry logger already knows to ignore. To enable, set dbStorage_directIOEntryLogger=true in the configuration. (cherry picked from commit 09a1c81) * format code * format code * fix spotbugs check * fix jnilib not found * fix jnilib not found * address comments * fix so lib not found * address comments * format code * add compat test * add compat test * address comments * fix findbugs failed * format code Co-authored-by: Ivan Kelly <ikelly@splunk.com>

eolivelli reviewed May 9, 2022

View reviewed changes

eolivelli reviewed May 10, 2022

View reviewed changes

dlg99 requested changes May 11, 2022

View reviewed changes

hangc0276 requested review from dlg99 and eolivelli June 13, 2022 07:41

eolivelli reviewed Jun 15, 2022

View reviewed changes

hangc0276 commented Jun 19, 2022

View reviewed changes

hangc0276 force-pushed the chenhang/directio-part6-direct-entrylogger-support branch from 3c93aed to 44f0f1a Compare June 20, 2022 01:34

eolivelli approved these changes Jun 20, 2022

View reviewed changes

dlg99 approved these changes Jun 21, 2022

View reviewed changes

Ivan Kelly and others added 15 commits June 22, 2022 22:07

format code

73aae68

format code

0266494

fix spotbugs check

284d502

fix jnilib not found

4b27a8c

fix jnilib not found

f0ab4fb

address comments

ae9e0e4

fix so lib not found

65e5bf7

address comments

46d61af

format code

e197963

add compat test

660aee4

add compat test

2f758b7

address comments

b221412

fix findbugs failed

8a4d0da

format code

726d359

hangc0276 force-pushed the chenhang/directio-part6-direct-entrylogger-support branch from 8adf1ba to 726d359 Compare June 22, 2022 14:08

merlimat approved these changes Jun 23, 2022

View reviewed changes

merlimat merged commit 677ccec into apache:master Jun 23, 2022

hangc0276 self-assigned this Jul 25, 2022

hangc0276 added area/bookie BP type/improvement labels Jul 25, 2022

hangc0276 added this to the 4.16.0 milestone Jul 25, 2022

hezhangjian mentioned this pull request Jun 1, 2024

Error in parsing entrylog file causes GC failure #3607

Closed


		int ledgersMapSize = sizeBuffer.get().readInt();

		if (ledgersMapSize <= 0) {

	Future<?> f = writeExecutor.submit(() -> {
	try {
	int ret = nativeIO.pwrite(fd, tmpBuffer.pointer(), bytesToWrite, offset);
	if (ret != bytesToWrite) {
	throw new IOException(exMsg("Incomplete write")
	.kv("filename", filename)
	.kv("writeSize", bytesToWrite)
	.kv("bytesWritten", ret)
	.kv("offset", offset).toString());
	}
	} catch (NativeIOException ne) {
	throw new IOException(exMsg("Write error")
	.kv("filename", filename)
	.kv("writeSize", bytesToWrite)
	.kv("errno", ne.getErrno())
	.kv("offset", offset).toString());
	} finally {
	bufferPool.release(tmpBuffer);
	}
	return null;
	});

Conversation

hangc0276 commented May 9, 2022

Motivation

Modification

Others

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlg99 commented May 10, 2022

Uh oh!

dlg99 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hangc0276 commented Jun 12, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hangc0276 commented Jun 19, 2022

Uh oh!

hangc0276 commented Jun 19, 2022

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

dlg99 left a comment

Choose a reason for hiding this comment

Uh oh!

hangc0276 commented Jun 22, 2022

Uh oh!