Merged
Conversation
…into tvaron3/readtimeout
Fixed the timeout logic
Fixed the timeout retry policy
…e-sdk-for-python into users/fabianm/tests
…into users/fabianm/tests
…into users/fabianm/tests
Member
Author
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
jeet1995
reviewed
May 13, 2025
sdk/cosmos/azure-cosmos/azure/cosmos/aio/_retry_utility_async.py
Outdated
Show resolved
Hide resolved
jeet1995
reviewed
May 13, 2025
sdk/cosmos/azure-cosmos/azure/cosmos/_global_partition_endpoint_manager_circuit_breaker.py
Show resolved
Hide resolved
jeet1995
reviewed
May 13, 2025
simorenoh
requested changes
May 23, 2025
Member
simorenoh
left a comment
There was a problem hiding this comment.
two small comments, great work!!
sdk/cosmos/azure-cosmos/azure/cosmos/_global_partition_endpoint_manager_circuit_breaker.py
Outdated
Show resolved
Hide resolved
…into tvaron3/ppcb # Conflicts: # sdk/cosmos/azure-cosmos/CHANGELOG.md # sdk/cosmos/azure-cosmos/azure/cosmos/_global_endpoint_manager.py
…into tvaron3/ppcb # Conflicts: # sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_client_connection.py # sdk/cosmos/azure-cosmos/azure/cosmos/_request_object.py # sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py # sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py # sdk/cosmos/azure-cosmos/azure/cosmos/aio/_container.py # sdk/cosmos/azure-cosmos/azure/cosmos/aio/_cosmos_client_connection_async.py # sdk/cosmos/azure-cosmos/tests/test_excluded_locations.py # sdk/cosmos/azure-cosmos/tests/test_excluded_locations_async.py # sdk/cosmos/azure-cosmos/tests/test_location_cache.py
Member
Author
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
sdk/cosmos/azure-cosmos/azure/cosmos/_container_recreate_retry_policy.py
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py
Show resolved
Hide resolved
…into tvaron3/ppcb # Conflicts: # sdk/cosmos/azure-cosmos/CHANGELOG.md # sdk/cosmos/azure-cosmos/azure/cosmos/_utils.py
Member
Author
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Member
Author
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
simorenoh
approved these changes
May 29, 2025
| client_timeout = kwargs.get('timeout') | ||
| start_time = time.time() | ||
| if request_params.healthy_tentative_location: | ||
| read_timeout = connection_policy.RecoveryReadTimeout |
Contributor
There was a problem hiding this comment.
This timeout can be overridden by connection_policy.DBAReadTimeout in line 99. Would this be okay?
| request_options["maxIntegratedCacheStaleness"] = max_integrated_cache_staleness_in_ms | ||
| if self.container_link in self.__get_client_container_caches(): | ||
| request_options["containerRID"] = self.__get_client_container_caches()[self.container_link]["_rid"] | ||
| await self._get_properties_with_options(request_options) |
Contributor
There was a problem hiding this comment.
Nit: Since _get_properties_with_options method returns container properties, which is self.client_connection._container_properties_cache[self.container_link], we can use the return value to get the containerRID in line365.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py
Show resolved
Hide resolved
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
There are certain issues that hard to diagnose from the client side if these are transient or if they are terminal availability issues. These could be network issues, partition upgrades, partition migrations, etc. For these issues, the sdk would retry the requests on another region, but would never mark the region as unavailable unless the failures were seen in the sdk health check.
Goal
Per partition circuit breaker is meant to lower the granularity down of a failover to the partition level for 408, 5xx status codes. The sdk should also now not only failover the requests but mark the partition as unavailable. This should prevent future requests for a time period from trying on the affected partition.
Solution
Scope
Per partition circuit breaker is applicable for
New Request Flow
flowchart TD A[Operation] --> B[Obtain effective partition key range from partition key or Obtain partition key range id ] B --> C[Use partition key range cache to determine partition key range] C--> G[Check if request can be marked as healthy tentative if necessary time has passed] G --> D[Use GlobalPartitionEndpointManagerForCircuitBreaker to extract Unavailable regions for partition key range] D --> E[Add them to effective excluded regions for operation] E --> F[Let GlobalEndpointManager determine next region]New State
Partitions will now have 4 health states tracked by a new class ParitionHealthTracker. The failure rate and consecutive failures will be tracked for partition. The statistics including the number of success and failures will be tracked for one minute and then reset for a partition. Once the partition reaches one of the thresholds it will be marked as unavailable. Requests will not be routed to partitions marked as unhealthy or unhealthy tentative for a region. The unavailable regions will be appended to the excluded locations from the user.
Healthy: This status indicates that the partition has seen only successful requests. A partition that is not tracked by PartitionHealthTracker is considered to be in Healthy status.
Unhealthy Tentative: This status indicates that a partition reached one of the thresholds. It will be unavailable for 1 minute until it is confirmed to be Unhealthy or Healthy.
Unhealthy: A partition is put in such a state after the sdk tried to recover the partition and failed. Requests will not go to a partition in this state.
Healthy Tentative: A request gets marked healthy tentative to check if a partition is healthy again. This request will have a request timeout of 6 seconds and will not be retried. Only one request should be marked healthy tentative when it is time to recover.
Service Request Errors are not tracked for circuit breaker and will keep behavior the same. There are three in region retries and then a region gets marked as unavailable.
Client Timeout Errors are not being tracked by circuit breaker because this error only gets raised right before a retry so the relevant errors are already being tracked.
New Environment Variables
"AZURE_COSMOS_ENABLE_CIRCUIT_BREAKER": Default will be false.
"AZURE_COSMOS_CONSECUTIVE_ERROR_COUNT_TOLERATED_FOR_READ": Default will be 10 errors.
"AZURE_COSMOS_CONSECUTIVE_ERROR_COUNT_TOLERATED_FOR_WRITE": Default will be 5 errors.
"AZURE_COSMOS_FAILURE_PERCENTAGE_TOLERATED": Default would be 90 percent.
Other potential configs to expose
Other Implementations and Differences
Azure/azure-sdk-for-java#39265
Azure/azure-cosmos-dotnet-v3#5023
Other Changes
Follow up work
Relevant Issue
#39687