Extend TransactionalBulkWriter with Additional Write Strategies by dibahlfi · Pull Request #48422 · Azure/azure-sdk-for-java

dibahlfi · 2026-03-16T01:31:19Z

Until now, TransactionalBulkWriter in the Cosmos DB Spark Connector only worked with a single write strategy — ItemOverwrite. Every document was sent as an upsert, and if the batch failed, there was nothing nuanced to reconstruct. That simplicity broke down the moment we needed deletes, conditional writes, or patches inside a transactional batch.
This PR extends TransactionalBulkWriter to support full set of write strategies.

A Cosmos transactional batch is all-or-nothing: if any single operation in the batch fails, the entire batch is rolled back and none of the operations take effect. But the failure of one operation doesn't always mean the batch is broken — sometimes it just means the document's state changed between when we read it and when we tried to write it.. A document we tried to create already exists, or a document we tried to delete is already gone. The remaining operations in the batch are still perfectly valid.

Reconstruction is how the writer deals with this. When a batch fails because of one operation's conflict with reality, the writer rebuilds the batch — swapping the offending operation for something harmless (like a read) or dropping it entirely — and resubmits. The rest of the operations get their chance to execute. Without reconstruction, the entire batch would fail permanently even though only one operation was the problem.

Each write strategy now maps to a specific Cosmos transactional batch operation and defines its own reconstruction behavior when a batch partially fails:

Multi-strategy transactional batches:
Each write strategy now maps to a specific Cosmos transactional batch operation and defines its own reconstruction behavior when a batch partially fails:

Strategy	Batch operation	What happens on failure
ItemOverwrite	upsertItemOperation	No reconstruction needed
ItemAppend	createItemOperation	409 Conflict -> reconstruct as Read (item already exists)
ItemDelete	deleteItemOperation	404/0 Not Found -> reconstruct as Remove (item already gone)
ItemDeleteIfNotModified	deleteItemOperation (with If-Match when ETag is present)	404/0 or 412 Precondition Failed -> reconstruct as Remove
ItemOverwriteIfNotModified	replaceItemOperation with If-Match (or createItemOperation when no ETag)	409 or 412 ->Read; 404/0 -> Remove
ItemPatch	patchItemOperation	No reconstruction needed
ItemPatchIfExists	patchItemOperation	404/0 -> Remove

ItemBulkUpdate is not supported in transactional mode and is now rejected at configuration time (see below).

Strategy examples:
ItemOverwrite — Upsert (fire-and-forget)
Batch contains: [upsert A, upsert B, upsert C]. If the batch fails with a transient error (e.g., 408 or 503), the entire batch is retried as-is — no reconstruction is needed because upsert is idempotent.

ItemAppend — Create, tolerate existing items
Batch contains: [create A, create B, create C]. Document B was already created by a prior attempt or an external process. The batch fails because the operation on B returns 409 Conflict. The other operations return 424 Failed Dependency (not attempted). Reconstruction changes B's operation from createItemOperation to readItemOperation, and the batch is resubmitted as [create A, read B, create C]. The read is a harmless no-op that keeps the batch structurally valid so A and C can be created.

ItemDelete — Delete, tolerate missing items
Batch contains: [delete A, delete B]. Document A has already been deleted. The batch fails because the operation on A returns 404/0 Not Found. Reconstruction removes A from the batch entirely, and the batch is resubmitted as [delete B]. If the reconstructed batch is empty (all items were already gone), it is treated as a trivial success.

ItemDeleteIfNotModified — Conditional delete with ETag
Batch contains: [delete A (ETag: "e1"), delete B (ETag: "e2")]. Document A was modified since we read it — the delete returns 412 Precondition Failed. Reconstruction removes A from the batch (we intentionally skip the delete since the precondition was not met), and the batch is resubmitted as [delete B (ETag: "e2")]. A 404/0 on a conditional delete is handled identically — the item is already gone.

ItemOverwriteIfNotModified — Conditional replace / create hybrid
Items with an ETag are sent as replaceItemOperation with If-Match; items without an ETag are sent as createItemOperation. Batch: [replace A (ETag: "e1"), create B].
• If A returns 412 Precondition Failed (modified since read) -> reconstruct A as Read.
• If B returns 409 Conflict (created externally) -> reconstruct B as Read.
• If A returns 404/0 (deleted between read and write) -> reconstruct A as Remove.
Resubmitted batch after a 412 on A: [read A, create B].

ItemPatch — Partial update (fire-and-forget)
Batch contains: [patch A (set /color = "red"), patch B (increment /count)]. No reconstruction is needed — if the batch fails with a transient error it is retried as-is.

ItemPatchIfExists — Patch only if the document exists
Batch contains: [patch A (set /color = "red"), patch B (set /color = "blue")]. Document A doesn't exist — the operation returns 404/0. Reconstruction removes A from the batch (missing documents are a no-op success), and the batch is resubmitted as [patch B (set /color = "blue")].

Transient errors vs. semantic errors:

Not every batch failure triggers reconstruction. The writer distinguishes between two categories of errors:
Semantic errors (reconstruction-eligible)
These indicate a logical conflict between the operation and the current state of the document. The batch cannot succeed by simply retrying — the offending operation must be modified

Status code	Meaning	Strategies that reconstruct
409 Conflict	Item already exists	ItemAppend, ItemOverwriteIfNotModified (create path)
404/0 Not Found	Item does not exist	ItemDelete, ItemDeleteIfNotModified, ItemOverwriteIfNotModified (replace path), ItemPatchIfExists
412 Precondition Failed	ETag mismatch (item was modified)	ItemDeleteIfNotModified, ItemOverwriteIfNotModified

Reconstruction does not consume a retry attempt — it fixes the batch shape and resubmits immediately.

Transient errors (retry-eligible)
These indicate an infrastructure or throttling issue that may resolve on its own. The batch is resubmitted unchanged (with its current reconstruction state preserved):

Status code	Meaning
408 Request Timeout	Request took too long — retried with a random back-off delay
410 Gone	Partition has moved — SDK refreshes routing and retries
500 Internal Server Error	Transient backend failure
503 Service Unavailable	Backend overloaded or temporarily down
404/1002	Partition key range gone (split in progress) — routing refreshes and retries
0 (status code zero)	Gateway mode PoolAcquirePendingLimitException — connection pool exhausted

For ItemOverwrite only, 404/0 is also treated as transient (rare race condition with TTL expiration).
Transient retries do consume the retry counter (attemptNumber). If transient retries exhaust maxRetryCount, the writer throws BulkOperationFailedException. Errors that are neither transient nor reconstruction-eligible — for example, a 400 Bad Request caused by a malformed document — fail immediately on the first attempt without retrying.

Decision flow on batch failure

flowchart TD
    A[Batch fails]
    A --> B{Per item results available}

    B -->|Yes| C[Find first non 424 result]
    C --> D{Reconstruction eligible?}
    D -->|Yes| E[Reconstruct & resubmit - no retry consumed]
    D -->|No| H[Fall through to transient retry]

    B -->|No| F{Exception only with no per item results}
    F -->|Yes| G{Fallback reconstruction possible?}
    G -->|Yes| E
    G -->|No| H

    H --> I{Transient and retries remaining?}
    I -->|Yes| J[Resubmit same batch increment attempt]
    I -->|No| K[BulkOperationFailedException]

A batch can go through both paths across retries. For example: first attempt hits a 409 (reconstructed), second attempt of the reconstructed batch hits a 503 (transient retry), third attempt succeeds.

How reconstruction works-
When a transactional batch fails, the writer needs to figure out which operation caused the failure and what to do about it before resubmitting.
There are two paths depending on what the SDK returns:
Path A — Per-item results available (preferred). The CosmosBatchResponse includes individual status codes for each operation. The writer finds the first non-success result (skipping 424 Failed Dependency responses, which are just downstream casualties), maps its status code and strategy to a reconstruction action (Read or Remove), and rebuilds the batch. This path is deterministic and safe.
Path B — Exception-only fallback. In some SDK code paths, the batch returns only a CosmosException with no per-item breakdown. Here the writer has to infer which operation(s) could have caused that status code, factoring in each operation's strategy and whether it carried an ETag. This is inherently less precise — and that's where the safety fix below comes in.

Handling ambiguity in the exception-only fallback-
In the exception-only fallback path (Path B above), the writer knows the status code but not which operation caused it. When multiple operations in the batch could plausibly have produced that status code, the writer has to decide how to proceed.

Consider a batch with two delete operations on the same partition key, both carrying ETags:

op0 = delete document A (ETag: "abc")
op1 = delete document B (ETag: "xyz")
The batch fails with 404/0 (Not Found), but the SDK only gives us the exception — no per-item results. Both op0 and op1 are valid candidates for that status code.

If the writer guessed wrong and reconstructed the wrong operation as Remove (meaning: "this item is already gone, drop it from the batch"), the other operation — the one that actually caused the failure — would silently disappear from all future retries. That's data loss by omission.

To prevent this, the writer distinguishes between destructive and non-destructive reconstruction actions:
Remove (destructive): When multiple candidates match and the reconstruction action is Remove, the writer skips reconstruction entirely and lets the normal retry/failure policy handle it. This avoids the risk of dropping the wrong operation.
Read (non-destructive): When multiple candidates match and the reconstruction action is Read, the writer uses deterministic tie-breaking (picks the first candidate). Guessing wrong here is harmless — it just adds an extra read to the batch without losing any operation.

Copilot

Pull request overview

This PR extends the Cosmos Spark connector’s transactional bulk support by adding more write strategies beyond pure upsert, and introduces a marker-document mechanism to disambiguate ambiguous retry outcomes for transactional batches. It also adjusts patch immutability checks and expands test coverage (unit + integration/e2e).

Changes:

Add transactional support for additional write strategies (append/create, delete, conditional delete/replace, patch) and enhance retry/ignore handling.
Introduce per-batch marker documents (with TTL + best-effort cleanup) to verify commit vs rollback on ambiguous retries.
Update/expand integration and E2E tests to cover the newly supported strategies and marker behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/TransactionalBulkWriter.scala	Adds multi-strategy transactional batch construction, marker-based commit verification, recovery changes, and retry/ignore logic updates.
sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/CosmosWriterBase.scala	Passes `PartitionKeyDefinition` into `TransactionalBulkWriter` for marker construction/patch behavior.
sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/CosmosPatchHelper.scala	Changes PK-path immutability check to use exact match rather than substring match.
sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/CosmosConfig.scala	Adds config for marker TTL and relaxes transactional write-strategy validation.
sdk/cosmos/azure-cosmos-spark_3/src/test/scala/com/azure/cosmos/spark/TransactionalBatchITest.scala	Updates integration tests to assert transactional acceptance for ItemAppend/Delete/OverwriteIfNotModified.
sdk/cosmos/azure-cosmos-spark_3/src/test/scala/com/azure/cosmos/spark/SparkE2ETransactionalBulkWriterITest.scala	Adds E2E transactional tests per strategy + atomicity/error cases + marker cleanup verification.
sdk/cosmos/azure-cosmos-spark_3/src/test/scala/com/azure/cosmos/spark/TransactionalBulkWriterSpec.scala	Adds a large unit-test-style spec covering strategy mapping, retry/ignore patterns, marker patterns, and PK keying behavior assumptions.

...mos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/TransactionalBulkWriter.scala

sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/CosmosConfig.scala

...azure-cosmos-spark_3/src/test/scala/com/azure/cosmos/spark/TransactionalBulkWriterSpec.scala

dibahlfi · 2026-04-07T20:01:52Z

...mos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/TransactionalBulkWriter.scala

+      log.logError(s"Partition key value '$partitionKeyString' has already been scheduled in this writer instance. " +
+        s"This indicates a bug in the data distribution or ordering pipeline. " +
+        s"Atomicity guarantee may be violated for this partition key value. " +
+        s"Context: ${operationContext.toString} $getThreadInfo")


...mos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/TransactionalBulkWriter.scala

dibahlfi · 2026-03-19T23:39:54Z

@sdkReviewAgent

xinlian12 · 2026-03-19T23:40:42Z

sdkReviewAgent | Status: ⏳ Queued

Review requested by @dibahlfi. I'll start shortly.

xinlian12 · 2026-03-19T23:40:43Z

sdkReviewAgent | Status: 🔍 Reviewing

I'm reviewing this PR now. I'll post my findings as comments when done.

xinlian12 · 2026-03-20T02:13:24Z

@sdkReviewAgent

xinlian12 · 2026-03-20T02:22:15Z

@sdkReviewAgent

xinlian12 · 2026-03-20T02:22:58Z

sdkReviewAgent | Status: ⏳ Queued

Review requested by @xinlian12. I'll start shortly.

xinlian12 · 2026-03-20T02:22:59Z

sdkReviewAgent | Status: 🔍 Reviewing

I'm reviewing this PR now. I'll post my findings as comments when done.

xinlian12 · 2026-03-20T17:19:13Z

@sdkReviewAgent-2

xinlian12 · 2026-03-20T18:01:58Z

The current design which requires to create & delete a marker file in user's container, I think from design perspective, this is a big NO, we should not create & delete items from SDK for user's prod container. This can cause a lot issues I think - like RU spike, what if the id is conflicted with existing real item id (which means we would accidentally delete documents etc), unknown format file could also crash customer's application etc

dibahlfi · 2026-04-03T01:32:15Z

@sdkReviewAgent-2

dibahlfi · 2026-04-06T16:43:50Z

@sdkReviewAgent-2

dibahlfi · 2026-04-06T19:19:28Z

@sdkReviewAgent-2

dibahlfi · 2026-04-07T16:20:02Z

@sdkReviewAgent-2

xinlian12 · 2026-04-07T16:24:40Z

@sdkReviewAgent-2

xinlian12 · 2026-04-07T16:56:32Z

...mos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/TransactionalBulkWriter.scala

+      return Some(candidateIndices.head -> reconstructionAction.get)
+    }
+
+    // Deterministic tie-breakers for exception-only paths when there is no per-item
+    // batch response available. This preserves forward progress while keeping strategy
+    // semantics stable in practice for transactional E2E recovery scenarios.
+    val selectedIndexOpt = itemWriteStrategy match {
+      case ItemWriteStrategy.ItemPatchIfExists =>
+        Some(candidateIndices.last)
+
+      case ItemWriteStrategy.ItemAppend |
+           ItemWriteStrategy.ItemDelete |
+           ItemWriteStrategy.ItemDeleteIfNotModified |
+           ItemWriteStrategy.ItemOverwriteIfNotModified =>
+        Some(candidateIndices.head)


🟡 Fallback Reconstruction Tie-Breaking — deterministic .head may select wrong item

When exception-only responses occur with multiple candidates, getFallbackReconstructionDecision uses .head / .last as tie-breakers (lines 1437, 1451). If the wrong item is selected, it gets reconstructed instead of the actual failing item.

This wastes a retry cycle but doesn't corrupt data (bounded by batch retry count). Consider returning None for ambiguous Remove reconstructions to be more conservative — skip the reconstruction rather than guess.

_{🤖 This comment was generated by an AI code review agent.}

xinlian12 · 2026-04-07T16:56:32Z

...mos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/TransactionalBulkWriter.scala

+
+  // Centralizes strategy-specific "original operation" construction so initial-batch creation and
+  // reconstructed-batch creation stay behaviorally identical for all supported strategies.
+  private def addOriginalOperationToBatch(


🟡 ItemBulkUpdate Configuration Gap

The PR removed the assertion blocking non-ItemOverwrite strategies, but addOriginalOperationToBatch still throws for ItemBulkUpdate (this method doesn't have a case for it). This means ItemBulkUpdate will fail at first batch write rather than at config time.

Consider adding config-time validation to fail fast with a clear error message instead of a runtime MatchError.

_{🤖 This comment was generated by an AI code review agent.}

xinlian12 · 2026-04-07T16:56:32Z

...mos-spark_3/src/test/scala/com/azure/cosmos/spark/SparkE2ETransactionalBulkWriterITest.scala

+  // Marker Cleanup Verification
+  // =====================================================
+
+  "transactional write marker cleanup" should "not leave marker documents after successful write" in {


🟡 Stale Marker Test — PR description documents removed feature

~60% of the PR description documents a "batch marker document" mechanism that was removed in commit b2e9a9731d8. This test still exists but passes trivially since no markers are created by the current implementation.

Please update the PR description to match the actual implementation, and either remove this test or update it to reflect current behavior.

_{🤖 This comment was generated by an AI code review agent.}

dibahlfi · 2026-04-07T20:05:21Z

@sdkReviewAgent-2

xinlian12 · 2026-04-07T20:45:43Z

@sdkReviewAgent-2

there is no new comments generated from review agent

dibahlfi · 2026-04-08T19:08:18Z

/azp run java - cosmos - tests

azure-pipelines · 2026-04-08T19:08:45Z

Azure Pipelines successfully started running 1 pipeline(s).

dibahlfi added 7 commits March 11, 2026 21:01

fix: transaction bulk writer - initial commit

8459ead

fix: adding support for marker document

3f82a79

fix: updated error handling

d5c86ad

fix: isAllowedProperty fix

c031b82

fix: adding tests

2bcfc4d

fix: adding tests

d39057f

fix: clean up

9be04c1

dibahlfi requested review from a team and kirankumarkolli as code owners March 16, 2026 01:31

Copilot AI review requested due to automatic review settings March 16, 2026 01:31

github-actions bot added the Cosmos label Mar 16, 2026

Copilot started reviewing on behalf of dibahlfi March 16, 2026 01:32 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

dibahlfi added 4 commits March 15, 2026 21:55

fix: addressing copilot comments

9ce346e

fix: addressing copilot comments

04676bc

fix: addressing copilot comments

69f0139

fix: clean up

17b8455

dibahlfi added 3 commits March 30, 2026 17:11

fix: remove marker document code

b2e9a97

fix: implement itemAppend strategy

26baf8c

fix: implement itemDelete strategy

1cac4ee

dibahlfi added 2 commits April 2, 2026 14:35

fix: implement itemPatch strategy

f5d0148

fix: adding more E2E tests

4800785

fix: fixing tests

27491a7

Azure deleted a comment from dibahlfi Apr 3, 2026

xinlian12 reviewed Apr 7, 2026

View reviewed changes

fix: fixing comments

330c786

fix: fixing build error

d9690a5

dibahlfi added 4 commits April 7, 2026 19:26

fix: fixing tests

6147a51

fix: fixing tests

000b9a4

fix: fixing tests

d69e982

fix: fixing tests

6d5c256

Conversation

dibahlfi commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dibahlfi Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dibahlfi commented Mar 19, 2026

Uh oh!

xinlian12 commented Mar 19, 2026

Uh oh!

xinlian12 commented Mar 19, 2026

Uh oh!

xinlian12 commented Mar 20, 2026

Uh oh!

xinlian12 commented Mar 20, 2026

Uh oh!

xinlian12 commented Mar 20, 2026

Uh oh!

xinlian12 commented Mar 20, 2026

Uh oh!

xinlian12 commented Mar 20, 2026

Uh oh!

xinlian12 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dibahlfi commented Apr 3, 2026

Uh oh!

dibahlfi commented Apr 6, 2026

Uh oh!

dibahlfi commented Apr 6, 2026

Uh oh!

dibahlfi commented Apr 7, 2026

Uh oh!

xinlian12 commented Apr 7, 2026

Uh oh!

xinlian12 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

dibahlfi Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

xinlian12 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

dibahlfi Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

xinlian12 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

dibahlfi Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

dibahlfi commented Apr 7, 2026

Uh oh!

xinlian12 commented Apr 7, 2026

Uh oh!

dibahlfi commented Apr 8, 2026

Uh oh!

azure-pipelines bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

dibahlfi commented Mar 16, 2026 •

edited

Loading

xinlian12 commented Mar 20, 2026 •

edited

Loading