Skip to content

Conversation

@krconv
Copy link

@krconv krconv commented Jan 15, 2026

The RegionReplicaSinkWriter.append() method checks table descriptors to determine if a table has region replication enabled (to decide whether to bypass the location cache). When a table is dropped concurrently, tableDescriptors.get(tableName) returns null, and the subsequent call to getRegionReplication() throws a NullPointerException.

This race condition can occur in the following scenario:

  1. WAL entries for a table are queued for replication to region replicas
  2. The table is dropped (via disable + drop or other means)
  3. Before the dropped table is added to the disabledAndDroppedTables cache (which happens when TableNotFoundException is caught during location lookup), the code attempts to read the table descriptor
  4. tableDescriptors.get() returns null for the now-deleted table
  5. NPE crashes the replication endpoint

Since RegionReplicaReplicationEndpoint handles replica updates for all tables on a RegionServer, a single dropped table crashes the entire endpoint. This stops replica updates for all regions (including those from unrelated tables) hosted by that RegionServer until it is restarted.

@Apache-HBase

This comment has been minimized.

@charlesconnell charlesconnell changed the title Fix for NPE in region replication HBASE-29831: Fix for NPE in region replication Jan 15, 2026
@charlesconnell charlesconnell self-requested a review January 15, 2026 12:56
@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

if (useCache && locations.size() == 1) {
if (tableDescriptors.get(tableName).getRegionReplication() > 1 && retries <= 3) {
TableDescriptor td = tableDescriptors.get(tableName);
if (td != null && td.getRegionReplication() > 1 && retries <= 3) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed through idea's smart suggestions that retries <= 3 seems is unnecessary.

Image

And I analyzed it and it's true

After removing it, there are 3 main cases, and none lead to an infinite loop:

  • case 1

First loop: useCache && locations.size() == 1 && RegionReplication > 1 is true.
Set useCache = false and continue.
Second loop: The logic will proceed and eventually return or break.

  • case2

First loop: useCache && locations.size() == 1 is true but RegionReplication > 1 is false.
Go to subsequent logic.
If !Bytes.equals(primaryLocation.getRegionInfo().getEncodedNameAsBytes(), encodedRegionName) is false: break (loop ends).
If !Bytes.equals(primaryLocation.getRegionInfo().getEncodedNameAsBytes(), encodedRegionName) is true and useCache is true: set useCache = false and continue. Second loop will then return or break.

  • case3

First loop: useCache && locations.size() == 1 is false.
Go to subsequent logic.
If useCache is alread false: return or break.
If useCache is true : similar to case 2, it will either break or retry once (setting useCache=false), then finish.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for walking through the logic here; I agree it is unneeded, just removed it

@guluo2016
Copy link
Member

Is it possible to add a unit test for this? Thanks

Copy link
Contributor

@chandrasekhar-188k chandrasekhar-188k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@krconv krconv force-pushed the HBASE-29831-read-replicas-npe branch from d95de47 to 1f43067 Compare January 20, 2026 17:15
@krconv
Copy link
Author

krconv commented Jan 20, 2026

Thanks for the reviews! Added a new unit test that catches the original problem, and removed the unneeded retries variable. Also, we encountered this problem on 100s of hosts across all of the data centers where we use HBase last week, hopefully this fix helps others

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 1m 15s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ branch-2 Compile Tests _
+1 💚 mvninstall 4m 29s branch-2 passed
+1 💚 compile 3m 45s branch-2 passed
+1 💚 checkstyle 0m 46s branch-2 passed
+1 💚 spotbugs 1m 53s branch-2 passed
+1 💚 spotless 0m 55s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+1 💚 mvninstall 4m 6s the patch passed
+1 💚 compile 3m 49s the patch passed
+1 💚 javac 3m 49s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 47s the patch passed
+1 💚 spotbugs 2m 15s the patch passed
+1 💚 hadoopcheck 24m 35s Patch does not cause any errors with Hadoop 2.10.2 or 3.3.6 3.4.1.
+1 💚 spotless 1m 2s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 13s The patch does not generate ASF License warnings.
52m 11s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7629/3/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #7629
JIRA Issue HBASE-29831
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname Linux f5e8bc2464c7 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 1f43067
Default Java Eclipse Adoptium-11.0.23+9
Max. process+thread count 79 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7629/3/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants