Skip to content

Conversation

@SourabhBadhya
Copy link
Contributor

What changes were proposed in this pull request?

Add aborted reason to transaction information

Why are the changes needed?

For better debuggability and understanding of why a transaction is aborted.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Most of the aborted reason messages are logging, hence was able to check that the appropriate messages are logged through existing tests.

Copy link
Contributor

@veghlaci05 veghlaci05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ticket says:

"Should add TXNS.COMMENT field or something..."

According to my understanding the abort reason should be presisted to the metastore, and should be shown using the show transactions command, or at least should be present in the TRANSACTIONS view

This part is completely missing from the PR. Do I missunderstand sth?

return errorCode;
}

public String getTxnErrorMsg() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getTxnErrorMsg is not used at all. Where the error code translates to error message? How the users will know which code means what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will remove this function. The mapping of errorCode to TxnErrorMsg is done here -
https://github.com/apache/hive/pull/3908/files#diff-22cbee1c93bb4d0cedd41531cd2d8631483ec310a6727bbdb2d73f8c830b6d95R74-R81

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link redirects to iteslf for some reason but I guess you wanted to point to line 1099 in TxnHandler. My main concern about the messages what I have also added as a general comment: The messages are not persisted and not shown to the users in any way (except in the logs).

/**
* The following class represents all the error messages that are handled for aborts.
*/
public enum TxnErrorMsg {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some messages are not real reasons, rather indicate only the type of the aborted txn. Please try to add a real reason as well, whenever possible.

Copy link
Contributor Author

@SourabhBadhya SourabhBadhya Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these messages are written based on where abortTxn()/abortTxns() is called. This represents the totality of how transactions are aborted from HiveServer. Anything else has not come from Hiveserver but rather from other services which might use transactional features of Hive (I have tried to represent them here as NONE). These are the only reasons where abort is done from Hiveserver.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@SourabhBadhya
Copy link
Contributor Author

SourabhBadhya commented Jan 4, 2023

@veghlaci05 Based on a earlier conversation for the same ticket - #3656 (comment)
It was decided that we must opt for logging instead of persisting it. I have edited the same in the Jira as well.

@veghlaci05
Copy link
Contributor

@veghlaci05 Based on a earlier conversation for the same ticket - #3656 (comment) It was decided that we must opt for logging instead of persisting it. I have edited the same in the Jira as well.

In this please ignore my comments regarding the presistence. While I still think persisting would be better, I have to admit it would make this task significantly more complicated. (As @deniskuzZ pointed out, aborted TXNs are cleaned up quickly).

Copy link
Contributor

@veghlaci05 veghlaci05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending tests

warriersruthi pushed a commit to warriersruthi/hive that referenced this pull request Jan 4, 2023
* Source Iceberg PR - Core: Remove deprecated APIs up to 0.13.0

* Revert "HIVE-25563: Iceberg table operations hang a long time if metadata is missing/corrupted (Adam Szita, reviewed by Marton Bod)" - applying instead  Hive: Limit number of retries when metadata file is missing (apache#3379)

This reverts commit 7b600fe.

* Source Iceberg PR - Hive: Limit number of retries when metadata file is missing (apache#3379)

* Source Iceberg PR - Hive: Fix RetryingMetaStoreClient for Hive 2.1 (apache#3403)

* Source Iceberg PR - Switch from new HashMap to Maps.newHashMap (apache#3648)

* Source Iceberg PR - Hive: HiveCatalog should remove HMS stats for certain engines based on config (apache#3652) - Use the Iceberg config property

* Source Iceberg PR - Core: If status check fails, commit should be unknown (apache#3717)

* Source Iceberg PR - Build: Add checkstyle rule for instantiating HashMap, HashSet, ArrayList (apache#3689)

* Source Iceberg PR - Test: Make sure to delete temp folders (apache#3790)

* Source Iceberg PR - API: Register existing tables in Iceberg HiveCatalog (apache#3851)

* Source Iceberg PR - Hive: Make Iceberg table filter optional in HiveCatalog (apache#3908)

* Source Iceberg PR - Core: Add reserved UUID Table Property and Expose in HMS. (apache#3914)

* Source Iceberg PR - Hive: Known exception should not become CommitStateUnknownException (apache#4261)

* Source Iceberg PR - Build: Add missing @OverRide annotations (apache#3654)
* @param errorCode Error code specifying the reason for abort.
* @throws TException
*/
void abortTxns(List<Long> txnids, long errorCode) throws TException;
Copy link
Member

@deniskuzZ deniskuzZ Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make this method generic and use Request object here as well, see

CompactionResponse compact2(CompactionRequest request)
void dropDatabase(DropDatabaseRequest req)

etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented. Done.

} else if (isHiveReplTxn) {
errorMsg = TxnErrorMsg.ABORT_DEFAULT_REPL_TXN;
} else {
errorMsg = TxnErrorMsg.ABORT_ROLLBACK;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you set it in the var declaration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assigned TxnErrorMsg.NONE as default. Done.

@RetrySemantics.Idempotent
public void abortTxns(AbortTxnsRequest rqst) throws MetaException {
List<Long> txnIds = rqst.getTxn_ids();
TxnErrorMsg txnErrorMsg = TxnErrorMsg.NONE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the above method (abortTxn(AbortTxnRequest rqst)), default value was ABORT_ROLLBACK and here it's NONE. Is this expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed default values in all places to TxnErrorMsg.NONE. Done.

if (txnids.isEmpty()) {
return 0;
}
LOG.info("Trying to abort txns due to : {}", errorMsg);
Copy link
Member

@deniskuzZ deniskuzZ Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe debug here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (MetastoreConf.getBoolVar(conf, MetastoreConf.ConfVars.METASTORE_ACIDMETRICS_EXT_ON)) {
Metrics.getOrCreateCounter(MetricsConstants.TOTAL_NUM_ABORTED_TXNS).inc(txnids.size());
}
LOG.info("Txns are aborted successfully due to : {}", errorMsg);
Copy link
Member

@deniskuzZ deniskuzZ Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is not informative: something aborted because of ...
Maybe print ids of aborted txns? Make sure we are not doing unnecessary logging. see

performTimeOuts()
 LOG.info("Aborted the following transactions due to timeout: {}", batchToAbort.toString());
LOG.info("Aborted {} transactions due to timeout", numTxnsAborted);

and

commitTxn()
String msg = "Aborting [" + JavaUtils.txnIdToString(txnid) + "," + commitId + "]" + " due to a write conflict on " + resource +
                          " committed by " + committedTxn + " " + rs.getString(7) + "/" + rs.getString(8);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performTimeOuts()
Removed this log - LOG.info("Aborted the following transactions due to timeout: {}", batchToAbort.toString());

For commitTxn()
Changed this log to include only the committed transaction details and not the ones that are going to be aborted.

Added aborted transaction IDs in the generic log.

if (MetastoreConf.getBoolVar(conf, MetastoreConf.ConfVars.METASTORE_ACIDMETRICS_EXT_ON)) {
Metrics.getOrCreateCounter(MetricsConstants.TOTAL_NUM_ABORTED_TXNS).inc(txnids.size());
}
LOG.info("Txns are aborted successfully due to : {}", errorMsg);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we populate TXN_META_INFO with aborted reason as well?
see //todo: should make abortTxns() write something into TXNS.TXN_META_INFO about this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You had mentioned that we should avoid persisting due to the below reason -
#3656 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simhadri wanted to add a new column in TXNS table for that, but looks like we already have one TXN_META_INFO. In some situations that might be useful, but I'll leave it up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the message contents in this column. Done.

errorMsg = TxnErrorMsg.ABORT_REPLAYED_REPL_TXN;
} else if (isHiveReplTxn) {
errorMsg = TxnErrorMsg.ABORT_DEFAULT_REPL_TXN;
} else if (rqst.isSetErrorCode() && rqst.getErrorCode() == TxnErrorMsg.ABORT_ROLLBACK.getErrorCode()) {
Copy link
Member

@deniskuzZ deniskuzZ Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this? why not

if (rqst.isSetErrorCode() {
  errorMsg = TxnErrorMsg.getErrorMsg(rqst.getErrorCode())
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should isReplayedReplTxn override rqst.getErrorCode()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the condition to -
if (rqst.isSetErrorCode() { errorMsg = TxnErrorMsg.getErrorMsg(rqst.getErrorCode()) }

As far as isReplayedReplTxn is concerned. There was no such intention of keeping it first in if else block.
Please note that isReplayedReplTxn & isHiveReplTxn is set only when request is passed through replRollbackTxn() call in the MetastoreClient.

numTxnsAborted += batchToAbort.size();
//todo: add TXNS.COMMENT filed and set it to 'aborted by system due to timeout'
Collections.sort(batchToAbort);//easier to read logs
LOG.info("Aborted the following transactions due to timeout: {}", batchToAbort.toString());
Copy link
Member

@deniskuzZ deniskuzZ Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above sorting was only needed for the logging, once you removed it, it became redundant, maybe it should be moved to abort method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 5790 prints the same message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added sort() in the generic logs section.
Line 5790 - This log prints the number of aborted txns. Included this info (number of aborted txns) in the generic logs as well.

if (MetastoreConf.getBoolVar(conf, MetastoreConf.ConfVars.METASTORE_ACIDMETRICS_EXT_ON)) {
Metrics.getOrCreateCounter(MetricsConstants.TOTAL_NUM_ABORTED_TXNS).inc(txnids.size());
}
LOG.warn("Txns are aborted successfully due to : {}. Aborted Transaction IDs : {}", errorMsg, txnids);
Copy link
Member

@deniskuzZ deniskuzZ Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOG.info("Aborted transactions {} due to {}", txnids, errorMsg);

errorMsg should be just text, like "heartbeat time-out"
either update the msg in TxnErrorMsg or use it as a pattern for the string formatter

ABORT_ROLLBACK(50005, "<<<Txn aborted due to>>> rollback"),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used this format of logging as suggested -
LOG.info("Aborted transactions {} due to {}", txnids, errorMsg);

Updated the error msgs to make it concise in TxnErrorMsg.

if (txnids.isEmpty()) {
return 0;
}
LOG.debug("Trying to abort txns due to : {}. Aborted Transaction IDs : {}", errorMsg, txnids);
Copy link
Member

@deniskuzZ deniskuzZ Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOG.debug("Trying to abort txns {} due to {}", txnids, errorMsg)
see above comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented as suggested.

Copy link
Member

@deniskuzZ deniskuzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending tests
Please check that "Aborting {} transaction(s) {} due to {}", txnids.size(), txnids, errorMsg
prints the list properly.

@SourabhBadhya
Copy link
Contributor Author

SourabhBadhya commented Jan 10, 2023

LGTM, pending tests Please check that "Aborting {} transaction(s) {} due to {}", txnids.size(), txnids, errorMsg prints the list properly.

Just checked this. It displays in this way -
Aborted 2 transaction(s) [1234, 5678] due to abort by query command

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 12 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@deniskuzZ deniskuzZ merged commit 3cba64f into apache:master Jan 10, 2023
yeahyung pushed a commit to yeahyung/hive that referenced this pull request Jul 20, 2023
…dhya, reviewed by Denys Kuzmenko, Laszlo Vegh)

Closes apache#3908
tarak271 pushed a commit to tarak271/hive-1 that referenced this pull request Dec 19, 2023
…dhya, reviewed by Denys Kuzmenko, Laszlo Vegh)

Closes apache#3908
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants