-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-26093 Replication is stuck due to zero length wal file in oldWALs directory #3504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
| try { | ||
| if (entryStream != null) { | ||
| entryStream.close(); | ||
| } | ||
| } catch (IOException ioe) { | ||
| LOG.warn("Exception while closing WALEntryStream", ioe); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be simplified as follows,
IOUtils.closeQuietly(entryStream, e -> LOG.warn("Exception while closing WALEntryStream", e));
| @Test | ||
| public void testEOFExceptionInOldWALsDirectory() throws Exception { | ||
| assertEquals(1, logQueue.getQueueSize(fakeWalGroupId)); | ||
| AbstractFSWAL abstractWAL = (AbstractFSWAL)log; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We don't need this down cast, do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bharathv It is needed. We are getting the current wal file name via AbstractFSWAL#getCurrentFileName so that we can truncate that wal file to create 0 size wal file. WAL class doesn't have this method.
|
🎊 +1 overall
This message was automatically generated. |
| } | ||
|
|
||
| private Path getArchivedLog(Path path) throws IOException { | ||
| Path getArchivedLog(Path path) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this method to a util class so we do not need to change the modifier and also do not need to pass the WALEntryStream to the handleEofException?
| } catch (InterruptedException e) { | ||
| LOG.trace("Interrupted while sleeping between WAL reads or adding WAL batch to ship queue"); | ||
| Thread.currentThread().interrupt(); | ||
| } finally { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the advantage here to not use try-with-resources but a try finally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to use entryStream object in handleEofException method. If I use try-wth-resources, I don't have access to entryStream in catch block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so if we could move the main logic in WALEntryStream.getArchivedLog to a util class, then we do not need try finally then? In general, I do not think we should expose WALEntryStream.getArchivedLog directly. Let's add a public static method in AbstractFSWALProvider, just below the getWALArchiveDirectoryName method. I think this is the right place for holding a public method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Apache9 for the review. Didn't know a utility method AbstractFSWALProvider#getArchivedLogPath already exists.
Removed all the changes to make WALEntryStream class. Please review again.
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
| LOG.warn("Forcing removal of 0 length log in queue: {}", head); | ||
| if (!fs.exists(path)) { | ||
| // There is a chance that wal has moved to oldWALs directory, so look there also. | ||
| path = AbstractFSWALProvider.getArchivedLogPath(path, conf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is not very stable...
You need to try all the possible path, just like what we have done in WALEntryStream. So I suggest we move the main logic of WALEntryStream.getArchivedLog to AbstractFSWALProvider, as a util method, and we call this method in both places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Apache9 for the feedback. I added a new util method in AbstractFSWALProvider to look into both the oldWALs path if it can't find the log in WALs directory.
After the change, there are 2 methods in AbstractFSWALProvider now to get archiveLog path:
getArchivedLogThis is the method added by this patch.getArchivedLogPathThis was already present which looks into separate oldLogDir path based on config.
I think we can remove getArchivedLogPath method and replace all callers with getArchivedLog. WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, just saw this comment. Please see my new comment below, you need to keep getArchivedLogPath, as it is used when we want to archive a wal file.
It is OK to remove the getArchivedLog in WALEntryStream.
| * @return archived path if exists, path - otherwise | ||
| * @throws IOException exception | ||
| */ | ||
| public static Path getArchivedLog(Path path, Configuration conf) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better name it findArchivedLog.
The above getArchivedLogPath is used when we want to move a file into the archived directory, we need to check the configuration to see if we want to put the log into the separate directory or not.
And for this method, it is used when we want to read a wal file, so we need to try all the places to see if we can find it, as we may change the way on how to archive the log on the fly(not sure but in general we want to support onConfigurationChange for more and more configurations).
So name it findArchivedLog will be better, and maybe we could just return null, or return an Optional object when we can not find the wal file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above getArchivedLogPath is used when we want to move a file into the archived directory, we need to check the configuration to see if we want to put the log into the separate directory or not.
In production code, these 3 places are where we are calling getArchivedLogPath util method.
- AbstractFSWALProvider#openReader
- ReplicationSource#getFileSize
- WALInputFormat.WALRecordReader#nextKeyValue
@Apache9 All of the above calls are trying to find the log in archive path after they couldn't locate the wal in walsDir and they are not used for moving a log file to archive directory. Correct me if I am wrong.
AbstractFSWALProvider#getWALArchiveDirectoryName is the mehod we use to locate the wal archive directory to move the wal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good. I think you have found potential bugs in our code base. We should change them all to make use of the new method you introduced in this PR. Suggest opening a new PR to address the incorrect usages of getArchivedLogPath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Thank you !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest opening a new PR to address the incorrect usages of getArchivedLogPath.
Created https://issues.apache.org/jira/browse/HBASE-26106 for this.
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
TestReplicationKillMasterRSWithSeparateOldWALs is a valid failure. It is failing here with the following stack trace I printed the wal file name and it failed with NPE above. Interestingly the path contains oldWALs directory and it is trying to find archive log for oldWALs directory. Below is the method of findArchivedLog method @Apache9 could we exit |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
Ran both failing tests on local machine and succeeded @Apache9 could you please review. |
|
Those test failures do not seem related. |
|
Let's give 24 hours for more responses or activity here. Otherwise I will merge this and pick it back to branch-2.4 and proceed with the 2.4.5 RC tomorrow (Friday PST). |
…Ls directory (#3504) Signed-off-by: Andrew Purtell <apurtell@apache.org> Signed-off-by: Bharath Vissapragada <bharathv@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
…Ls directory (#3504) Signed-off-by: Andrew Purtell <apurtell@apache.org> Signed-off-by: Bharath Vissapragada <bharathv@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
No description provided.