MLE-28401: Add missing guard for logging retry message#591
MLE-28401: Add missing guard for logging retry message#591yunzvanessa wants to merge 294 commits intomainfrom
Conversation
…rnal Entity Polaris medium severity issues (#577) * MLE-27554 MLE-27556: Fix XXE Injection - XML External Entity related to XMLInputFactory * MLE-27553 MLE-27555: Fix XXE Injection - XML External Entity related to TransformerFactory * MLE-27553 MLE-27554 MLE-27555 MLE-27556: Split XXE-prevention try/catch blocks and improve warning messages
…ontentReader.java
MLE-27557: Fix weak crypto - replace insecure TLSv1 with TLSv1.3 in ContentReader.java
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ss spurious wakeup vulnerability Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
MLE-26865, MLE-26867: Fixes for issues identified by Polaris
…om-develop PDP-1182: Remove per-repo pr-workflow.yaml
There was a problem hiding this comment.
Pull request overview
This PR addresses misleading “Retrying committing batch” debug logging by adding a commitRetry > 0 guard, and it also includes a broad set of additional robustness/security/concurrency updates across mlcp and the MapReduce connector.
Changes:
- Guard retry-specific debug logging so it only appears on actual retry attempts.
- Add defensive checks (null/empty config values, filesystem edge cases, safer iteration over server responses) and harden XML processing against XXE.
- Refactor local multithreaded execution coordination (wait/notify →
CountDownLatch) and bump project/dependency versions (plus CI/workflow/config updates).
Reviewed changes
Copilot reviewed 33 out of 34 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/main/java/com/marklogic/tree/ExpandedTree.java | Avoid repeated string computation in trace logging path (minor refactor). |
| src/main/java/com/marklogic/mapreduce/utilities/InternalUtilities.java | More defensive handling of INPUT_HOST configuration. |
| src/main/java/com/marklogic/mapreduce/test/FCheck.java | Guard against File.listFiles() returning null. |
| src/main/java/com/marklogic/mapreduce/examples/HelloWorld.java | Null-safe handling of document element when extracting text. |
| src/main/java/com/marklogic/mapreduce/examples/ContentReader.java | Update SSL example defaults (protocols/ciphers) and validate output path. |
| src/main/java/com/marklogic/mapreduce/JSONDocument.java | Configure TransformerFactory to restrict external DTD/stylesheet access (XXE hardening). |
| src/main/java/com/marklogic/mapreduce/ForestReader.java | Null-check parent directory path before building related file paths. |
| src/main/java/com/marklogic/mapreduce/DOMDocument.java | Configure TransformerFactory to restrict external DTD/stylesheet access (XXE hardening) and add null guards. |
| src/main/java/com/marklogic/mapreduce/ContentOutputFormat.java | Validate output hosts and add safer result iteration (hasNext() checks). |
| src/main/java/com/marklogic/dom/NodeImpl.java | Prevent NPE when parent node is null; make isEqualNode local-name compare null-safe. |
| src/main/java/com/marklogic/dom/DocumentImpl.java | Null-safe namespace lookup/prefix lookup when document element is missing. |
| src/main/java/com/marklogic/dom/CommentImpl.java | Null-safe namespace lookup delegation when parent is missing. |
| src/main/java/com/marklogic/dom/CharacterDataImpl.java | Null-safe length/substring handling for character data. |
| src/main/java/com/marklogic/dom/AttrImpl.java | Null-safe namespace lookup delegation when owner element is missing. |
| src/main/java/com/marklogic/contentpump/utilities/PermissionUtil.java | Validate OUTPUT_HOST presence before creating output ContentSource. |
| src/main/java/com/marklogic/contentpump/utilities/JSONDocBuilder.java | Guard against null exception messages when filtering “missing value”. |
| src/main/java/com/marklogic/contentpump/TransformWriter.java | Null-safe handling of exception message when detecting “Module Not Found”. |
| src/main/java/com/marklogic/contentpump/TransformOutputFormat.java | Validate OUTPUT_HOST presence before querying mimetypes. |
| src/main/java/com/marklogic/contentpump/ThreadManager.java | Replace wait/notify coordination with latch; validate output hosts; minor control-flow guard. |
| src/main/java/com/marklogic/contentpump/SingleDocumentWriter.java | Null-check parent directory before mkdirs(). |
| src/main/java/com/marklogic/contentpump/RDFReader.java | Safer permissions initialization; validate hosts; add result iteration guards. |
| src/main/java/com/marklogic/contentpump/OutputArchive.java | Null-check parent directory before mkdirs(). |
| src/main/java/com/marklogic/contentpump/MultithreadedMapper.java | Introduce latch-based signaling when runners are created. |
| src/main/java/com/marklogic/contentpump/LocalJobRunner.java | Wire latch into MultithreadedMapper tasks; ensure latch is released on failure. |
| src/main/java/com/marklogic/contentpump/InputType.java | Validate output hosts; add ResultSequence hasNext() guards for role map query. |
| src/main/java/com/marklogic/contentpump/DatabaseTransformWriter.java | Add commitRetry > 0 guard to retry log message. |
| src/main/java/com/marklogic/contentpump/DatabaseContentReader.java | Add ResultSequence hasNext() guards; null-check W3C permission element conversion. |
| src/main/java/com/marklogic/contentpump/ContentPump.java | Guard against listFiles() returning null when building lib jar list. |
| src/main/java/com/marklogic/contentpump/CompressedAggXMLReader.java | Disable external entities/DTDs on XMLInputFactory (XXE hardening). |
| src/main/java/com/marklogic/contentpump/AggregateXMLReader.java | Disable external entities/DTDs on XMLInputFactory (XXE hardening). |
| pom.xml | Bump project version and Avro dependency version. |
| .github/workflows/pr-workflow.yaml | Remove JIRA ID validator workflow. |
| .copyrightconfig | Add copyright validation config and exclusions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| String atomStr = atomString(textReps[index]); | ||
| LOG.trace("atom " + textReps[index] + " [" + | ||
| atomString(textReps[index]) + "] length " + | ||
| atomString(textReps[index]).length()); | ||
| atomStr + "] length " + | ||
| (atomStr != null ? atomStr.length() : "0")); | ||
| } | ||
| buf.append(atomString(textReps[index++])); | ||
| } |
There was a problem hiding this comment.
atomString(textReps[index]) is now computed into atomStr for logging, but buf.append(atomString(...)) still recomputes it. Reuse atomStr for the append too to avoid duplicate work and to ensure the trace output matches the actual appended value if atomString is non-trivial/non-deterministic.
| String[] hosts = conf.getStrings(INPUT_HOST); | ||
| if (hosts == null || hosts.length == 0) { | ||
| throw new IllegalArgumentException(INPUT_HOST + | ||
| " is not specified."); | ||
| } | ||
|
|
||
| return getInputContentSource(conf, host); | ||
| return getInputContentSource(conf, hosts[0]); | ||
| } |
There was a problem hiding this comment.
This change now checks for hosts == null || hosts.length == 0, but it no longer validates that hosts[0] is non-empty/blank. If INPUT_HOST is set to an empty string, getInputContentSource(conf, hosts[0]) will still be called and will likely fail later with a less clear error; consider restoring an explicit empty/blank check for hosts[0] here.
| } | ||
| String data = getData(); | ||
| if (data == null) { | ||
| throw new DOMException(DOMException.DOMSTRING_SIZE_ERR, null); |
There was a problem hiding this comment.
substringData now throws DOMSTRING_SIZE_ERR when getData() returns null, but getLength() treats null data as length 0. This makes behavior inconsistent and introduces a new exception path for callers; consider treating null data as an empty string here (or throwing a more appropriate DOMException like INVALID_STATE_ERR), consistent with getLength().
| throw new DOMException(DOMException.DOMSTRING_SIZE_ERR, null); | |
| data = ""; |
| if (LOG.isDebugEnabled()) { | ||
| LOG.debug(getFormattedBatchId() + | ||
| "Retrying committing batch, attempts: " + | ||
| commitRetry + "/" + MAX_RETRIES); |
There was a problem hiding this comment.
The retry log message prints commitRetry + "/" + MAX_RETRIES, but the loop condition is commitRetry < commitRetryLimit. If commitRetryLimit is configurable and diverges from MAX_RETRIES, this debug message becomes misleading; consider logging against commitRetryLimit here to match the actual retry limit being enforced.
| commitRetry + "/" + MAX_RETRIES); | |
| commitRetry + "/" + commitRetryLimit); |
| @@ -17,7 +17,7 @@ | |||
| </licenses> | |||
|
|
|||
| <properties> | |||
| <versionString>12.0</versionString> | |||
| <versionString>12.1</versionString> | |||
There was a problem hiding this comment.
PR description/title focuses on adding a logging guard for retry messages, but this PR also bumps the project version (and includes many other changes across the codebase). If these additional changes are intentional, please reflect them in the PR description or consider splitting into separate PRs to keep scope focused and simplify review/release notes.
The "Retrying committing batch" debug message was logged on every batch, even the first attempt, producing misleading output that suggested retries were happening when they weren't. Added a
commitRetry > 0guard to match the existing pattern in the parent classTransformWriter.This issue surfaced when investigating and mlcp regression failure: https://progresssoftware.atlassian.net/browse/MLE-28401