-
Notifications
You must be signed in to change notification settings - Fork 3.7k
#6386 Enforce null check at Catalog.getDb and Database.getTable #6416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } finally { | ||
| olapTable.readUnlock(); | ||
| } | ||
| } catch (MetaNotFoundException ignored) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is equals to continue
| } finally { | ||
| tbl.writeUnlock(); | ||
| } catch (MetaNotFoundException e) { | ||
| LOG.warn("[INCONSISTENT META] changing table status failed after rollup job done", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a INCONSISTENT META
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called by replayAlterJobV2() and onJobDone(), which should both not throw this exception.
| } | ||
|
|
||
| public void replayModifyComment(ModifyCommentOperationLog operation) { | ||
| public void replayModifyComment(ModifyCommentOperationLog operation) throws MetaNotFoundException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a special Exception to handle those "replay" logic.
I am worried that MetaNotFoundException will be thrown elsewhere in the replay logic, but it may be unexpected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is considerable. But "replay" throwing any exception is an extremely big risk, which will cause all FE crush and cannot recover. These MetaNotFoundException are mostly thrown by getDb and getTable, due to the lock inconsistence that makes editlogs out of order. Semantically, throwing this exception means some metadata the editlog want to edit on is missing during replay, which makes this replay unable to continue. Just like getDb/Table, This kind of inconsistence cannot recover anyway, and these editlogs are theoretically only affect lost metadata.
I prefer to use MetaNotFoundException that indicate metadata has lost rather than other exception which may cause confuse. The inconsistence can be checked from warning log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ccoffline how about check db or table state after getting db or table lock , so that the edit log will always keep in order? we don't need some much check code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@caiconghui we don't have any promise that the edit logs are in order. These check code is to prevent the worst from happening.
Edit logs that out of order may cause meta inconsistent, which has to be fixed sooner or later. We are exploring ways to ensure consistency and minimize the cost. Until then, we have to check all NPE or let all the FE to crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ccoffline so we only need to simply check null for getdb and get table in replay operation and just simply skip the following operation? now, the only way that fe may crash is with that replay operation when fe restart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it better to split this PR into two PRs ? one for code refactor and another for db and table null check, I think it is easy to review and not-error-prone when PR is not so big? @ccoffline
| // we log rollup job cancelled even if db is dropped. | ||
| // so check db != null here | ||
| Database db = catalog.getDbOrMetaException(alterJob.getDbId()); | ||
| alterJob.replayCancel(db); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not same logic as before.
If db is dropped, addFinishedOrCancelledAlterJob() will not be called.
better using try ... finally ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix this.
| break; | ||
| } | ||
| } catch (MetaNotFoundException e) { | ||
| LOG.warn("[INCONSISTENT META] replay rollup job failed {}", replayedJob.getJobId(), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why catch here? It is already been caught in Editlog's loadJournal() method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is called by replayAlterJobV2. I'm afraid it would be different from the origin.
https://github.com/apache/incubator-doris/blob/138e7e896dee6842d8af7583d59288190371cd86/fe/fe-core/src/main/java/org/apache/doris/alter/AlterHandler.java#L460-L469
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| try { | ||
| db = Catalog.getCurrentCatalog().getDbOrMetaException(dbId); | ||
| } catch (MetaNotFoundException e) { | ||
| throw new IOException("error happens when parsing create materialized view stmt: " + origStmt, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it continue?
If db does not exist, when this job is being executed, it will check it and cancel the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The origin code will throw NPE, I really don't know what to do here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, just keep it as before.
| OlapTable olapTable; | ||
| try { | ||
| olapTable = (OlapTable) db.getTableOrThrowException(tableName, Table.TableType.OLAP); | ||
| olapTable = db.getTableOrMetaException(tableName, Table.TableType.OLAP); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just call getTableOrDdlException() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, that make sense
| olapTable = db.getTableOrMetaException(tblName, Table.TableType.OLAP); | ||
| } catch (MetaNotFoundException e) { | ||
| return ResponseEntityBuilder.okWithCommonError(e.getMessage()); | ||
| return ResponseEntityBuilder.notFound(e.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use okWithCommonError.
404 is not for I can't find object in Doris meta, it is just for url not found in http protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| table = db.getTableOrMetaException(tblName, Table.TableType.OLAP); | ||
| } catch (MetaNotFoundException e) { | ||
| return ResponseEntityBuilder.okWithCommonError(e.getMessage()); | ||
| return ResponseEntityBuilder.notFound(e.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use okWithCommonError
| try { | ||
| db = Catalog.getCurrentCatalog().getDbOrMetaException(fullDbName); | ||
| } catch (MetaNotFoundException e) { | ||
| return ResponseEntityBuilder.notFound(e.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okWithCommonError
| } finally { | ||
| db.readUnlock(); | ||
| } catch (MetaNotFoundException e) { | ||
| return ResponseEntityBuilder.notFound(e.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okWithCommonError
| } | ||
| } | ||
| } catch (MetaNotFoundException e) { | ||
| LOG.warn("[INCONSISTENT META] replay failed {}", journal, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add e.getMessage() so that it is easier to grep in one line.
And some of journal may not implement toString() method. So I think we can just print op code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered printing the original journal content, but gave up because of the complicated deserialization. JournalEntity has implemented toString() method and has already print the op code. It's better for all the Writable objects to implement toString() method.
I'll add e.getMessage() in one line and keep the exception stack trace.
| } | ||
| if (dataDescription.getMergeType() != LoadTask.MergeType.APPEND | ||
| && !((table instanceof OlapTable) && ((OlapTable) table).hasDeleteSign()) ) { | ||
| if (dataDescription.getMergeType() != LoadTask.MergeType.APPEND && table.hasDeleteSign()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (dataDescription.getMergeType() != LoadTask.MergeType.APPEND && table.hasDeleteSign()) { | |
| if (dataDescription.getMergeType() != LoadTask.MergeType.APPEND && !table.hasDeleteSign()) { |
morningman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
morningman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
Proposed changes
fix #5378 #5391 #5688 #5973 #6155 and all replay NPE. All replay method can now throw
MetaNotFoundExceptionand caught to log a warning for potential inconsistent metadata cases.try to establish a clear notice for future developer to check null.
Types of changes
Checklist
Further comments
If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...