-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-11495: Add aborted reason to transaction information #3908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
veghlaci05
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ticket says:
"Should add TXNS.COMMENT field or something..."
According to my understanding the abort reason should be presisted to the metastore, and should be shown using the show transactions command, or at least should be present in the TRANSACTIONS view
This part is completely missing from the PR. Do I missunderstand sth?
...tastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/TxnErrorMsg.java
Outdated
Show resolved
Hide resolved
| return errorCode; | ||
| } | ||
|
|
||
| public String getTxnErrorMsg() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getTxnErrorMsg is not used at all. Where the error code translates to error message? How the users will know which code means what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I will remove this function. The mapping of errorCode to TxnErrorMsg is done here -
https://github.com/apache/hive/pull/3908/files#diff-22cbee1c93bb4d0cedd41531cd2d8631483ec310a6727bbdb2d73f8c830b6d95R74-R81
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link redirects to iteslf for some reason but I guess you wanted to point to line 1099 in TxnHandler. My main concern about the messages what I have also added as a general comment: The messages are not persisted and not shown to the users in any way (except in the logs).
| /** | ||
| * The following class represents all the error messages that are handled for aborts. | ||
| */ | ||
| public enum TxnErrorMsg { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some messages are not real reasons, rather indicate only the type of the aborted txn. Please try to add a real reason as well, whenever possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these messages are written based on where abortTxn()/abortTxns() is called. This represents the totality of how transactions are aborted from HiveServer. Anything else has not come from Hiveserver but rather from other services which might use transactional features of Hive (I have tried to represent them here as NONE). These are the only reasons where abort is done from Hiveserver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
|
@veghlaci05 Based on a earlier conversation for the same ticket - #3656 (comment) |
In this please ignore my comments regarding the presistence. While I still think persisting would be better, I have to admit it would make this task significantly more complicated. (As @deniskuzZ pointed out, aborted TXNs are cleaned up quickly). |
veghlaci05
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending tests
* Source Iceberg PR - Core: Remove deprecated APIs up to 0.13.0 * Revert "HIVE-25563: Iceberg table operations hang a long time if metadata is missing/corrupted (Adam Szita, reviewed by Marton Bod)" - applying instead Hive: Limit number of retries when metadata file is missing (apache#3379) This reverts commit 7b600fe. * Source Iceberg PR - Hive: Limit number of retries when metadata file is missing (apache#3379) * Source Iceberg PR - Hive: Fix RetryingMetaStoreClient for Hive 2.1 (apache#3403) * Source Iceberg PR - Switch from new HashMap to Maps.newHashMap (apache#3648) * Source Iceberg PR - Hive: HiveCatalog should remove HMS stats for certain engines based on config (apache#3652) - Use the Iceberg config property * Source Iceberg PR - Core: If status check fails, commit should be unknown (apache#3717) * Source Iceberg PR - Build: Add checkstyle rule for instantiating HashMap, HashSet, ArrayList (apache#3689) * Source Iceberg PR - Test: Make sure to delete temp folders (apache#3790) * Source Iceberg PR - API: Register existing tables in Iceberg HiveCatalog (apache#3851) * Source Iceberg PR - Hive: Make Iceberg table filter optional in HiveCatalog (apache#3908) * Source Iceberg PR - Core: Add reserved UUID Table Property and Expose in HMS. (apache#3914) * Source Iceberg PR - Hive: Known exception should not become CommitStateUnknownException (apache#4261) * Source Iceberg PR - Build: Add missing @OverRide annotations (apache#3654)
| * @param errorCode Error code specifying the reason for abort. | ||
| * @throws TException | ||
| */ | ||
| void abortTxns(List<Long> txnids, long errorCode) throws TException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should make this method generic and use Request object here as well, see
CompactionResponse compact2(CompactionRequest request)
void dropDatabase(DropDatabaseRequest req)
etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented. Done.
| } else if (isHiveReplTxn) { | ||
| errorMsg = TxnErrorMsg.ABORT_DEFAULT_REPL_TXN; | ||
| } else { | ||
| errorMsg = TxnErrorMsg.ABORT_ROLLBACK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you set it in the var declaration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assigned TxnErrorMsg.NONE as default. Done.
| @RetrySemantics.Idempotent | ||
| public void abortTxns(AbortTxnsRequest rqst) throws MetaException { | ||
| List<Long> txnIds = rqst.getTxn_ids(); | ||
| TxnErrorMsg txnErrorMsg = TxnErrorMsg.NONE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the above method (abortTxn(AbortTxnRequest rqst)), default value was ABORT_ROLLBACK and here it's NONE. Is this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed default values in all places to TxnErrorMsg.NONE. Done.
| if (txnids.isEmpty()) { | ||
| return 0; | ||
| } | ||
| LOG.info("Trying to abort txns due to : {}", errorMsg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe debug here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| if (MetastoreConf.getBoolVar(conf, MetastoreConf.ConfVars.METASTORE_ACIDMETRICS_EXT_ON)) { | ||
| Metrics.getOrCreateCounter(MetricsConstants.TOTAL_NUM_ABORTED_TXNS).inc(txnids.size()); | ||
| } | ||
| LOG.info("Txns are aborted successfully due to : {}", errorMsg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message is not informative: something aborted because of ...
Maybe print ids of aborted txns? Make sure we are not doing unnecessary logging. see
performTimeOuts()
LOG.info("Aborted the following transactions due to timeout: {}", batchToAbort.toString());
LOG.info("Aborted {} transactions due to timeout", numTxnsAborted);
and
commitTxn()
String msg = "Aborting [" + JavaUtils.txnIdToString(txnid) + "," + commitId + "]" + " due to a write conflict on " + resource +
" committed by " + committedTxn + " " + rs.getString(7) + "/" + rs.getString(8);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For performTimeOuts()
Removed this log - LOG.info("Aborted the following transactions due to timeout: {}", batchToAbort.toString());
For commitTxn()
Changed this log to include only the committed transaction details and not the ones that are going to be aborted.
Added aborted transaction IDs in the generic log.
| if (MetastoreConf.getBoolVar(conf, MetastoreConf.ConfVars.METASTORE_ACIDMETRICS_EXT_ON)) { | ||
| Metrics.getOrCreateCounter(MetricsConstants.TOTAL_NUM_ABORTED_TXNS).inc(txnids.size()); | ||
| } | ||
| LOG.info("Txns are aborted successfully due to : {}", errorMsg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we populate TXN_META_INFO with aborted reason as well?
see //todo: should make abortTxns() write something into TXNS.TXN_META_INFO about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You had mentioned that we should avoid persisting due to the below reason -
#3656 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simhadri wanted to add a new column in TXNS table for that, but looks like we already have one TXN_META_INFO. In some situations that might be useful, but I'll leave it up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the message contents in this column. Done.
3c2be0b to
c76271a
Compare
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java
Outdated
Show resolved
Hide resolved
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java
Show resolved
Hide resolved
...tastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/TxnErrorMsg.java
Outdated
Show resolved
Hide resolved
| errorMsg = TxnErrorMsg.ABORT_REPLAYED_REPL_TXN; | ||
| } else if (isHiveReplTxn) { | ||
| errorMsg = TxnErrorMsg.ABORT_DEFAULT_REPL_TXN; | ||
| } else if (rqst.isSetErrorCode() && rqst.getErrorCode() == TxnErrorMsg.ABORT_ROLLBACK.getErrorCode()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this? why not
if (rqst.isSetErrorCode() {
errorMsg = TxnErrorMsg.getErrorMsg(rqst.getErrorCode())
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should isReplayedReplTxn override rqst.getErrorCode()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the condition to -
if (rqst.isSetErrorCode() { errorMsg = TxnErrorMsg.getErrorMsg(rqst.getErrorCode()) }
As far as isReplayedReplTxn is concerned. There was no such intention of keeping it first in if else block.
Please note that isReplayedReplTxn & isHiveReplTxn is set only when request is passed through replRollbackTxn() call in the MetastoreClient.
...etastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/TxnHandler.java
Outdated
Show resolved
Hide resolved
| numTxnsAborted += batchToAbort.size(); | ||
| //todo: add TXNS.COMMENT filed and set it to 'aborted by system due to timeout' | ||
| Collections.sort(batchToAbort);//easier to read logs | ||
| LOG.info("Aborted the following transactions due to timeout: {}", batchToAbort.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
above sorting was only needed for the logging, once you removed it, it became redundant, maybe it should be moved to abort method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 5790 prints the same message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added sort() in the generic logs section.
Line 5790 - This log prints the number of aborted txns. Included this info (number of aborted txns) in the generic logs as well.
streaming/src/java/org/apache/hive/streaming/TransactionBatch.java
Outdated
Show resolved
Hide resolved
| if (MetastoreConf.getBoolVar(conf, MetastoreConf.ConfVars.METASTORE_ACIDMETRICS_EXT_ON)) { | ||
| Metrics.getOrCreateCounter(MetricsConstants.TOTAL_NUM_ABORTED_TXNS).inc(txnids.size()); | ||
| } | ||
| LOG.warn("Txns are aborted successfully due to : {}. Aborted Transaction IDs : {}", errorMsg, txnids); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOG.info("Aborted transactions {} due to {}", txnids, errorMsg);
errorMsg should be just text, like "heartbeat time-out"
either update the msg in TxnErrorMsg or use it as a pattern for the string formatter
ABORT_ROLLBACK(50005, "<<<Txn aborted due to>>> rollback"),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used this format of logging as suggested -
LOG.info("Aborted transactions {} due to {}", txnids, errorMsg);
Updated the error msgs to make it concise in TxnErrorMsg.
| if (txnids.isEmpty()) { | ||
| return 0; | ||
| } | ||
| LOG.debug("Trying to abort txns due to : {}. Aborted Transaction IDs : {}", errorMsg, txnids); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOG.debug("Trying to abort txns {} due to {}", txnids, errorMsg)
see above comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented as suggested.
deniskuzZ
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending tests
Please check that "Aborting {} transaction(s) {} due to {}", txnids.size(), txnids, errorMsg
prints the list properly.
...etastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/TxnHandler.java
Outdated
Show resolved
Hide resolved
...tastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/TxnErrorMsg.java
Outdated
Show resolved
Hide resolved
...tastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/TxnErrorMsg.java
Outdated
Show resolved
Hide resolved
Just checked this. It displays in this way - |
|
Kudos, SonarCloud Quality Gate passed! |
…dhya, reviewed by Denys Kuzmenko, Laszlo Vegh) Closes apache#3908
…dhya, reviewed by Denys Kuzmenko, Laszlo Vegh) Closes apache#3908








What changes were proposed in this pull request?
Add aborted reason to transaction information
Why are the changes needed?
For better debuggability and understanding of why a transaction is aborted.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Most of the aborted reason messages are logging, hence was able to check that the appropriate messages are logged through existing tests.