Skip to content

BP-47 (task6): Direct I/O entrylogger support#3263

Merged
merlimat merged 15 commits intoapache:masterfrom
hangc0276:chenhang/directio-part6-direct-entrylogger-support
Jun 23, 2022
Merged

BP-47 (task6): Direct I/O entrylogger support#3263
merlimat merged 15 commits intoapache:masterfrom
hangc0276:chenhang/directio-part6-direct-entrylogger-support

Conversation

@hangc0276
Copy link
Copy Markdown
Contributor

Motivation

Task 6 of BP-47, Direct I/O entrylogger support.

This is a standalone module for direct I/O entry logger support.

Modification

The implementation uses JNI to do direct I/O to files via posix syscalls. Fallocate is used if running on linux, otherwise this is skipped (at the cost of more filesystem operates during writing).

There are two calls to write, writeAt and writeDelimited. I expect writeAt to be used for the entrylog headers, which entries will go through writeDelimited. In both cases, the calls may return before the syscalls occur. #flush() needs to be called to ensure things are actually written.

The entry log format isn't much changed from what is used by the existing entrylogger. The biggest difference is the padding. Direct I/O must write in aligned blocked. The size of the alignment varies by machine configuration, but 4K is a safe bet on most. As it is unlikely that entry data will land exactly on the alignment boundary, we need to add padding to writes. The existing entry logger has been changed to take this padding into account. When read as a signed int/long/byte the padding will aways parse to a negative value, which distinguishes it from valid entry data (the entry size will always be positive) and also from preallocated space (which is always 0).

Another difference in the format is that the header is now 4K rather than 1K. Again, this is to allow aligned rights. No changes are necessary to allow the existing entry logger to deal with the header change, as we create a dummy entry in the extra header space that the existing entry logger already knows to ignore.

Others

The commit is made by @ivankelly . This is the sub task of #2932 , which pushed out by @mauricebarnum. However, this PR contains too many changes and the number of code lines reaches 8K+, and it is hard to review. According to your suggestion #2932 (comment), and after communicate with @mauricebarnum, we are planing to divide the PR into 6 PRs, Please refer to #2943 (comment).

However, @mauricebarnum doesn't have enough time to deal with those issues, and we desperately need this feature. After communicated with @mauricebarnum and @merlimat , I help to work on divide the PRs and push them out. We are hoping to contain this feature in BookKeeper 4.16.0


int ledgersMapSize = sizeBuffer.get().readInt();

if (ledgersMapSize <= 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we adding this check ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are not ledgers in ledgerMap, we just skip the following reading.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this case was not possible before this change ?
btw, it does no harm.
thanks for clarification

Comment thread pom.xml Outdated
<forkedProcessTimeoutInSeconds>1800</forkedProcessTimeoutInSeconds>
<rerunFailingTestsCount>${testRetryCount}</rerunFailingTestsCount>
<useSystemClassLoader>false</useSystemClassLoader>
<argLine>-Djava.library.path=target/natives</argLine>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be useless, isn't it ?

Comment thread pom.xml Outdated
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-failsafe-plugin</artifactId>
<version>${maven-failsafe-plugin.version}</version>
<executions>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to add this "executions" flag, otherwise we are going to execute failsafe every time you run Maven.

it is automatically pre-configured, it makes sense to declare it only in order to set a version

Copy link
Copy Markdown
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another pass.
The patch is very big, we need more eyes

@dlg99 @merlimat @rdhabalia @RaulGracia @reddycharan @Ghatage


int ledgersMapSize = sizeBuffer.get().readInt();

if (ledgersMapSize <= 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this case was not possible before this change ?
btw, it does no harm.
thanks for clarification

@dlg99
Copy link
Copy Markdown
Contributor

dlg99 commented May 10, 2022

The biggest difference is the padding. Direct I/O must write in aligned blocked.
Another difference in the format is that the header is now 4K rather than 1K. Again, this is to allow aligned rights. No changes are necessary to allow the existing entry logger to deal with the header change, as we create a dummy entry in the extra header space that the existing entry logger already knows to ignore.

I think this calls to addition of backwards-compat tests, upgrade/downgrade scenario (data written with these changes can be read by previous version of BK and vice versa)

https://github.com/apache/bookkeeper/tree/master/tests/backward-compat

Copy link
Copy Markdown
Contributor

@dlg99 dlg99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments, plus needs backwards compat tests.

/**
* O(nlogn) algorithm to find largest contiguous gap between
* integers in a passed list. n should be relatively small.
* Entry logs should be about 1GB in size, so even if the node
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not guaranteed in case of entry log per ledger.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can only be guaranteed by EntryLogManagerForSingleEntryLog

* Entry logs should be about 1GB in size, so even if the node
* stores a PB, there should be only 1000000 entry logs.
*/
static int[] findLargestGap(List<Integer> currentIds) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use Pair<> instead of int[] because that's what it is

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


private void addOutstandingWrite(Future<?> toAdd) throws IOException {
synchronized (outstandingWrites) {
outstandingWrites.add(toAdd);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no limit on number of outstandingWrites.
Typically we start with that and end up adding some kind of limiter or throttler to prevent system overload.
Does it make sense to do it right now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. We can add a limiter for it in the following Pr. This Pr is huge enough, I have tried my best to keep it simple.

Comment on lines +283 to +306
try {
if (bytesToWrite <= 0) {
return null;
}
int ret = nativeIO.pwrite(fd, bufferToFlush.pointer(), bytesToWrite, offsetToWrite);
if (ret != bytesToWrite) {
throw new IOException(exMsg("Incomplete write")
.kv("filename", filename)
.kv("pointer", bufferToFlush.pointer())
.kv("offset", offsetToWrite)
.kv("writeSize", bytesToWrite)
.kv("bytesWritten", ret).toString());
}
} catch (NativeIOException ne) {
throw new IOException(exMsg(ne.getMessage())
.kv("filename", filename)
.kv("offset", offsetToWrite)
.kv("writeSize", bytesToWrite)
.kv("pointer", bufferToFlush.pointer())
.kv("errno", ne.getErrno()).toString(), ne);
} finally {
bufferPool.release(bufferToFlush);
}
return null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as lines 120+? consider refactoring

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max line length is 112, so no need to refactor.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the code at

Future<?> f = writeExecutor.submit(() -> {
try {
int ret = nativeIO.pwrite(fd, tmpBuffer.pointer(), bytesToWrite, offset);
if (ret != bytesToWrite) {
throw new IOException(exMsg("Incomplete write")
.kv("filename", filename)
.kv("writeSize", bytesToWrite)
.kv("bytesWritten", ret)
.kv("offset", offset).toString());
}
} catch (NativeIOException ne) {
throw new IOException(exMsg("Write error")
.kv("filename", filename)
.kv("writeSize", bytesToWrite)
.kv("errno", ne.getErrno())
.kv("offset", offset).toString());
} finally {
bufferPool.release(tmpBuffer);
}
return null;
});
looks similar, can probably refactor, this will also guarantee "if (bytesToWrite <= 0)" check in both places

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

final BufferPool bufferPool;
final ExecutorService writeExecutor;
final Object bufferLock = new Object();
final List<Future<?>> outstandingWrites = new ArrayList<Future<?>>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it be ConcurrentLinkedQueue to avoid synchronization later?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not. There are two methods addOutstandingWrite and waitForOutstandingWrites using the outstandingWrites. It does not just add and remove operation but uses an iterator to traverse the list content. We need synchronization to keep it in order.

try {
fd = nativeIO.open(filename,
NativeIO.O_CREAT | NativeIO.O_WRONLY | NativeIO.O_DIRECT,
00755);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

00755 as in permissions? why 755 and not e.g. 600 or 640?
"755 means read and execute access for everyone and also write access for the owner of the file."
Why are we granting execute access to this file?
Why read access is needed to everyone?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

644 is better. The current ledger log file's permission is 644, we can keep the same as the previous one.

flushBuffer();
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should do waitForOutstandingWrites(); here or call flush() instead of flushBuffer() to confirm that the last outstanding write (submitted e.g. by flushBuffer()) actually succeeded.

@hangc0276
Copy link
Copy Markdown
Contributor Author

The biggest difference is the padding. Direct I/O must write in aligned blocked.
Another difference in the format is that the header is now 4K rather than 1K. Again, this is to allow aligned rights. No changes are necessary to allow the existing entry logger to deal with the header change, as we create a dummy entry in the extra header space that the existing entry logger already knows to ignore.

I think this calls to addition of backwards-compat tests, upgrade/downgrade scenario (data written with these changes can be read by previous version of BK and vice versa)

https://github.com/apache/bookkeeper/tree/master/tests/backward-compat

Ok, I will add the backwards-compact tests.

@hangc0276 hangc0276 requested review from dlg99 and eolivelli June 13, 2022 07:41
help = "Operation stats of reading entries from the entry log",
parent = BOOKIE_READ_ENTRY
)
private final ThreadLocal<OpStatsLogger> readEntryStats;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ThreadLocal must be static, because it is actually not bound to a specific instance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those stats which use ThreadLocal need the passed parameter stats, It is hard to make them static

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://stackoverflow.com/a/2784059/2237794
You are getting per thread/per instance in this case instead of per thread

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// DirectEntryLogger, DefaultEntryLogger and DirectEntryLogger.
// DirectEntryLogger -> DefaultEntryLogge -> DirectEntryLogger.
@Test
public void testCompatFromDirectToDefaultToDirectLogger() throws Exception {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlg99 I have been added the compat test here

@hangc0276
Copy link
Copy Markdown
Contributor Author

The biggest difference is the padding. Direct I/O must write in aligned blocked.
Another difference in the format is that the header is now 4K rather than 1K. Again, this is to allow aligned rights. No changes are necessary to allow the existing entry logger to deal with the header change, as we create a dummy entry in the extra header space that the existing entry logger already knows to ignore.

I think this calls to addition of backwards-compat tests, upgrade/downgrade scenario (data written with these changes can be read by previous version of BK and vice versa)

https://github.com/apache/bookkeeper/tree/master/tests/backward-compat

@dlg99 I have added all the compat tests in TestDirectEntryLoggerCompat to ensure the DefaultEntryLogger and DirectEntryLogger switch between each other doesn't impact the entry read and write.

@hangc0276
Copy link
Copy Markdown
Contributor Author

All the comments have been addressed, please help take a look. This Pr is huge, we need more eyes.

@merlimat @eolivelli @dlg99 @ivankelly @rdhabalia @RaulGracia @reddycharan @Ghatage

@hangc0276 hangc0276 force-pushed the chenhang/directio-part6-direct-entrylogger-support branch from 3c93aed to 44f0f1a Compare June 20, 2022 01:34
Copy link
Copy Markdown
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@dlg99 dlg99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.
I think the non-static ThreadLocal is not doing what you wanted it to do, please check if there are ways to solve this

@hangc0276
Copy link
Copy Markdown
Contributor Author

rerun failure checks

Ivan Kelly and others added 15 commits June 22, 2022 22:07
The implementation uses JNI to do direct I/O to files via posix
syscalls. Fallocate is used if running on linux, otherwise this is
skipped (at the cost of more filesystem operates during writing).

There are two calls to write, writeAt and writeDelimited. I expect
writeAt to be used for the entrylog headers, which entries will go
through writeDelimited. In both cases, the calls may return before the
syscalls occur. #flush() needs to be called to ensure things are
actually written.

The entry log format isn't much changed from what is used by the
existing entrylogger. The biggest difference is the padding. Direct
I/O must write in aligned blocked. The size of the alignment varies by
machine configuration, but 4K is a safe bet on most. As it is unlikely
that entry data will land exactly on the alignment boundary, we need
to add padding to writes. The existing entry logger has been changed
to take this padding into account. When read as a signed int/long/byte
the padding will aways parse to a negative value, which distinguishes
it from valid entry data (the entry size will always be positive) and
also from preallocated space (which is always 0).

Another difference in the format is that the header is now 4K rather
than 1K. Again, this is to allow aligned rights. No changes are
necessary to allow the existing entry logger to deal with the header
change, as we create a dummy entry in the extra header space that the
existing entry logger already knows to ignore.

To enable, set dbStorage_directIOEntryLogger=true in the configuration.

(cherry picked from commit 09a1c81)
@hangc0276 hangc0276 force-pushed the chenhang/directio-part6-direct-entrylogger-support branch from 8adf1ba to 726d359 Compare June 22, 2022 14:08
@hangc0276
Copy link
Copy Markdown
Contributor Author

rerun failure checks

1 similar comment
@hangc0276
Copy link
Copy Markdown
Contributor Author

rerun failure checks

Copy link
Copy Markdown
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for carrying on this huge set of changes. Great work!

@merlimat merlimat merged commit 677ccec into apache:master Jun 23, 2022
@hangc0276 hangc0276 self-assigned this Jul 25, 2022
@hangc0276 hangc0276 added this to the 4.16.0 milestone Jul 25, 2022
Ghatage pushed a commit to sijie/bookkeeper that referenced this pull request Jul 12, 2024
* Direct I/O entry log support

The implementation uses JNI to do direct I/O to files via posix
syscalls. Fallocate is used if running on linux, otherwise this is
skipped (at the cost of more filesystem operates during writing).

There are two calls to write, writeAt and writeDelimited. I expect
writeAt to be used for the entrylog headers, which entries will go
through writeDelimited. In both cases, the calls may return before the
syscalls occur. #flush() needs to be called to ensure things are
actually written.

The entry log format isn't much changed from what is used by the
existing entrylogger. The biggest difference is the padding. Direct
I/O must write in aligned blocked. The size of the alignment varies by
machine configuration, but 4K is a safe bet on most. As it is unlikely
that entry data will land exactly on the alignment boundary, we need
to add padding to writes. The existing entry logger has been changed
to take this padding into account. When read as a signed int/long/byte
the padding will aways parse to a negative value, which distinguishes
it from valid entry data (the entry size will always be positive) and
also from preallocated space (which is always 0).

Another difference in the format is that the header is now 4K rather
than 1K. Again, this is to allow aligned rights. No changes are
necessary to allow the existing entry logger to deal with the header
change, as we create a dummy entry in the extra header space that the
existing entry logger already knows to ignore.

To enable, set dbStorage_directIOEntryLogger=true in the configuration.

(cherry picked from commit 09a1c81)

* format code

* format code

* fix spotbugs check

* fix jnilib not found

* fix jnilib not found

* address comments

* fix so lib not found

* address comments

* format code

* add compat test

* add compat test

* address comments

* fix findbugs failed

* format code

Co-authored-by: Ivan Kelly <ikelly@splunk.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants