Skip to content

Conversation

@ccoffline
Copy link
Contributor

@ccoffline ccoffline commented Aug 10, 2021

Proposed changes

fix #5378 #5391 #5688 #5973 #6155 and all replay NPE. All replay method can now throw MetaNotFoundException and caught to log a warning for potential inconsistent metadata cases.

try to establish a clear notice for future developer to check null.

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)
  • Code refactor (Modify the code structure, format the code, etc...)
  • Optimization. Including functional usability improvements and performance improvements.
  • Dependency. Such as changes related to third-party components.
  • Other.

Checklist

  • I have created an issue on (Fix [Proposal] Enforce null check at Catalog.getDb and Database.getTable #6386) and described the bug/feature there in detail
  • Compiling and unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • If these changes need document changes, I have updated the document
  • Any dependent changes have been merged

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

} finally {
olapTable.readUnlock();
}
} catch (MetaNotFoundException ignored) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equals to continue

} finally {
tbl.writeUnlock();
} catch (MetaNotFoundException e) {
LOG.warn("[INCONSISTENT META] changing table status failed after rollup job done", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a INCONSISTENT META

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called by replayAlterJobV2() and onJobDone(), which should both not throw this exception.

@morningman morningman added area/catalog Issues or PRs related to catalog management kind/improvement labels Aug 31, 2021
}

public void replayModifyComment(ModifyCommentOperationLog operation) {
public void replayModifyComment(ModifyCommentOperationLog operation) throws MetaNotFoundException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a special Exception to handle those "replay" logic.
I am worried that MetaNotFoundException will be thrown elsewhere in the replay logic, but it may be unexpected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is considerable. But "replay" throwing any exception is an extremely big risk, which will cause all FE crush and cannot recover. These MetaNotFoundException are mostly thrown by getDb and getTable, due to the lock inconsistence that makes editlogs out of order. Semantically, throwing this exception means some metadata the editlog want to edit on is missing during replay, which makes this replay unable to continue. Just like getDb/Table, This kind of inconsistence cannot recover anyway, and these editlogs are theoretically only affect lost metadata.
I prefer to use MetaNotFoundException that indicate metadata has lost rather than other exception which may cause confuse. The inconsistence can be checked from warning log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccoffline how about check db or table state after getting db or table lock , so that the edit log will always keep in order? we don't need some much check code?

Copy link
Contributor Author

@ccoffline ccoffline Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@caiconghui we don't have any promise that the edit logs are in order. These check code is to prevent the worst from happening.
Edit logs that out of order may cause meta inconsistent, which has to be fixed sooner or later. We are exploring ways to ensure consistency and minimize the cost. Until then, we have to check all NPE or let all the FE to crash.

Copy link
Contributor

@caiconghui caiconghui Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccoffline so we only need to simply check null for getdb and get table in replay operation and just simply skip the following operation? now, the only way that fe may crash is with that replay operation when fe restart?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it better to split this PR into two PRs ? one for code refactor and another for db and table null check, I think it is easy to review and not-error-prone when PR is not so big? @ccoffline

// we log rollup job cancelled even if db is dropped.
// so check db != null here
Database db = catalog.getDbOrMetaException(alterJob.getDbId());
alterJob.replayCancel(db);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not same logic as before.
If db is dropped, addFinishedOrCancelledAlterJob() will not be called.
better using try ... finally ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix this.

break;
}
} catch (MetaNotFoundException e) {
LOG.warn("[INCONSISTENT META] replay rollup job failed {}", replayedJob.getJobId(), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why catch here? It is already been caught in Editlog's loadJournal() method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@morningman morningman Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

try {
db = Catalog.getCurrentCatalog().getDbOrMetaException(dbId);
} catch (MetaNotFoundException e) {
throw new IOException("error happens when parsing create materialized view stmt: " + origStmt, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it continue?
If db does not exist, when this job is being executed, it will check it and cancel the job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The origin code will throw NPE, I really don't know what to do here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, just keep it as before.

OlapTable olapTable;
try {
olapTable = (OlapTable) db.getTableOrThrowException(tableName, Table.TableType.OLAP);
olapTable = db.getTableOrMetaException(tableName, Table.TableType.OLAP);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just call getTableOrDdlException() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, that make sense

olapTable = db.getTableOrMetaException(tblName, Table.TableType.OLAP);
} catch (MetaNotFoundException e) {
return ResponseEntityBuilder.okWithCommonError(e.getMessage());
return ResponseEntityBuilder.notFound(e.getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use okWithCommonError.
404 is not for I can't find object in Doris meta, it is just for url not found in http protocol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

table = db.getTableOrMetaException(tblName, Table.TableType.OLAP);
} catch (MetaNotFoundException e) {
return ResponseEntityBuilder.okWithCommonError(e.getMessage());
return ResponseEntityBuilder.notFound(e.getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use okWithCommonError

try {
db = Catalog.getCurrentCatalog().getDbOrMetaException(fullDbName);
} catch (MetaNotFoundException e) {
return ResponseEntityBuilder.notFound(e.getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okWithCommonError

} finally {
db.readUnlock();
} catch (MetaNotFoundException e) {
return ResponseEntityBuilder.notFound(e.getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okWithCommonError

}
}
} catch (MetaNotFoundException e) {
LOG.warn("[INCONSISTENT META] replay failed {}", journal, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add e.getMessage() so that it is easier to grep in one line.
And some of journal may not implement toString() method. So I think we can just print op code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered printing the original journal content, but gave up because of the complicated deserialization. JournalEntity has implemented toString() method and has already print the op code. It's better for all the Writable objects to implement toString() method.
I'll add e.getMessage() in one line and keep the exception stack trace.

}
if (dataDescription.getMergeType() != LoadTask.MergeType.APPEND
&& !((table instanceof OlapTable) && ((OlapTable) table).hasDeleteSign()) ) {
if (dataDescription.getMergeType() != LoadTask.MergeType.APPEND && table.hasDeleteSign()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (dataDescription.getMergeType() != LoadTask.MergeType.APPEND && table.hasDeleteSign()) {
if (dataDescription.getMergeType() != LoadTask.MergeType.APPEND && !table.hasDeleteSign()) {

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

github-actions bot commented Sep 3, 2021

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Sep 3, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Sep 3, 2021

PR approved by anyone and no changes requested.

@morningman morningman merged commit df54b34 into apache:master Sep 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. area/catalog Issues or PRs related to catalog management kind/improvement reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Proposal] Enforce null check at Catalog.getDb and Database.getTable

4 participants