-
Notifications
You must be signed in to change notification settings - Fork 115
Merge Health Model work into ci_feature behind a feature flag Pending perf testing #246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
102 commits
Select commit
Hold shift + click to select a range
6d7199f
changes
3048d13
changes
7bd5eac
changes
6f62c6c
changes
109ee9d
changes
8d585fd
changes
f6b1e02
changes
c19eb1d
changes
fd3cfb5
changes
8e16e32
changes
9453d70
changes
4514780
chg
fc8a1ed
changes for kubelet health
0e28a4e
changes
627de84
changes to include message
4efb7ac
Merge ci_feature into node-health-perf
r-dilip 14ab446
First iteration of health monitor signals
r-dilip b8dcc52
Fixed Bugs for NotifyInstantly Monitor
r-dilip 1a2d8ce
Health and Input plugins, logs cleaned up
r-dilip e0c431e
Hooking up input and filter plugins to out_oms_api plugin
r-dilip 654c7c9
1. Tag Changes 2. Adding Health Monitor Configuration 3. Added Agent …
r-dilip c5e739f
Merge branch 'ci_feature' into dilipr/node-health-perf
r-dilip b60ee71
Fix Base Container.data, include kube-system containers, fix input pl…
r-dilip 99b7fe1
More fixes to config, process kube-system
r-dilip fd5bbf6
Adding in_kube_health
r-dilip 8b9fd53
Merging after pulling
r-dilip e667406
Send Node_name parameter to reduceSignal for node level monitors
r-dilip 31a3931
Fix Typo in method invocation
r-dilip c380a5e
1. Added pod_status monitor (unused), 2. Removed processing for conta…
r-dilip c61a3e6
Merge branch 'dilipr/kubeHealth' of https://github.com/Microsoft/Dock…
r-dilip b68572f
Fix issue when pods are created since last kube api
r-dilip b80366e
Remove duplicate plugin entry from container.conf
r-dilip bac932d
Merging after fixing conflicts from ci_feature
r-dilip c6d0fee
Updating Agent Version in fluent-bit config
r-dilip 42957df
Updating Agent Version
r-dilip 8ab3ee8
Fix Error when Pods dont have a controller
r-dilip a8837f8
Add Telemetry for plugin start
r-dilip f6e9c0e
Merge from ci_feature
r-dilip 147d688
Change getMonitorInstanceId method signature
r-dilip 80a5d36
Remove references to HealthMonitorRecord struct in code
r-dilip d7a71d5
Rake
a18eb83
Running Ruby tests
r-dilip c7a0a50
Merge branch 'dilipr/rubyTest' into dilipr/healthModelAggregation
r-dilip 5895951
Working Version for Health Model Builder on the agent
r-dilip 8892458
Calculate old and new states for Aggregate and Unit Monitors
r-dilip 81df39f
Remove Controller Name from labels and details, use Deployment/Daemon…
r-dilip deda155
Change label namespaces, remove ClusterName from records sent, send d…
r-dilip a57b0b4
Configuration Split for Monitors
r-dilip 9dbc7a8
working version for 2 pods before naming changes
r-dilip 0f9f5d4
Working Model Builder version after name changes, TODO: test on the a…
r-dilip 2f7be02
E2E working version for health model aggregation TODO: Missing Signal…
r-dilip 0f210f5
Change pod-aggregator to workload-name, remove node monitor hierarchy…
r-dilip b89b107
Refactor signal reduction logic
r-dilip adb8f94
Missing Pod signals/Node Signals send none or unknown based on the in…
r-dilip 876bb3c
serialization and deserialization of state
r-dilip 7c459c4
Working cadvisor_health_node filter
r-dilip 497c26a
working version E2E with state serialization and deserialization
r-dilip f3520fe
adding source, health config to base_container.data
r-dilip bc57eb2
Container conf changes, permissions for log files etc.
r-dilip ded1867
Merge branch 'ci_feature' into dilipr/refactorSignalReduction
r-dilip 88621c7
Reinstate run_interval that was removed accidentally
r-dilip 966a0b1
Remove single sample flip configs, fixed details.to_json bug, pass in…
r-dilip 5edf616
Remove unnecessary logging
r-dilip 09063ba
Fix Aggregation logic for 'percentage' agg algorithm monitors
r-dilip aba0d17
Scale up Scale down bugs fixed, sending none signal on first occurenc…
r-dilip 24b0479
Enable state initialization, fix bug where records are always sent th…
r-dilip d3d267a
Fix percentage agg algorithm state calculation
r-dilip d0f4a7b
Fix the bug where if signal is unknown state, its state is not update…
r-dilip 990f70c
fix compute percentage bug when value is in warning state
r-dilip 275fcf3
Update state_transition_time to current time whenever state change ha…
r-dilip 2901e99
Update missing signal state to be the instance state for correct rollup
r-dilip bd7cf0a
1. Remove some unnecessary logging
r-dilip 23fa7a2
Removing calls to kube api since they are not required as of now. Wil…
r-dilip 1697f40
Send telemetry for cluster level state changes
r-dilip ec65d49
Testing Rake
r-dilip d0a62d3
First Round of Tests
r-dilip 2e50407
added integration tests for aks and aks-engine
r-dilip 8af3554
committing missing renamed file
r-dilip 02bce13
Fix base_Container.data
r-dilip 0d4ae84
Added test_helpers.rb
r-dilip 60384df
Fix ruby 1.9 issue where __dir__is not recognized
r-dilip ce8c748
moving some methods into health_monitor_helpers, so that unit tests c…
r-dilip c70cfe7
Changed references to health_monitor_helpers
r-dilip 603ab25
Fixing ruby incompatibility errors
r-dilip 338b752
Dont load health_monitor_utils
r-dilip dd8dfef
Dumm commit to force pull
r-dilip c161fc1
remove non existent file from base_container.data, update Makefile
r-dilip 142a5a5
Updated tomlparser.rb to handle agent_settings for health_model
r-dilip d415f07
Fixing merge conflicts from ci_feature
r-dilip d9f2e4e
Toggle health plugins based on Feature flag
r-dilip 7b09fcf
Added health_monitor_helpers, and fixed log
r-dilip fa5e31d
Send start telemetry only if health model is enabled
r-dilip 0cf6870
PRfeedback
r-dilip 5d92eee
Renamed offending file name that was causing ruby to fail loading
r-dilip 69e9aac
change name in base_container
r-dilip 6917ea2
Remove non existent file
r-dilip ea46649
Add health_monitor_helpers
r-dilip 28f1ccb
Merge branch 'dilipr/testInfra' into dilipr/mergeHealthToCiFeature
r-dilip 0cd2b80
Use CRD for state persistence (#248)
r-dilip 25ed658
Merge branch 'dilipr/mergeHealthToCiFeature' of https://github.com/Mi…
r-dilip 56bb430
Fixing merge conflict from ci_feature
r-dilip 8d14026
Dummy update
r-dilip File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| require 'rake/testtask' | ||
|
|
||
| task default: "test" | ||
|
|
||
| Rake::TestTask.new do |task| | ||
| task.libs << "test" | ||
| task.pattern = './test/code/plugin/health/*_spec.rb' | ||
| task.warning = false | ||
| end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,248 @@ | ||
| [ | ||
| { | ||
| "monitor_id": "user_workload_pods_ready", | ||
| "parent_monitor_id": "user_workload", | ||
| "labels": [ | ||
| "container.azm.ms/namespace", | ||
| "container.azm.ms/workload-name", | ||
| "container.azm.ms/workload-kind", | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ] | ||
| }, | ||
| { | ||
| "monitor_id": "user_workload", | ||
| "parent_monitor_id": "namespace", | ||
| "labels": [ | ||
| "container.azm.ms/namespace", | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ] | ||
| }, | ||
| { | ||
| "monitor_id": "system_workload_pods_ready", | ||
| "parent_monitor_id": "system_workload", | ||
| "labels": [ | ||
| "container.azm.ms/namespace", | ||
| "container.azm.ms/workload-name", | ||
| "container.azm.ms/workload-kind", | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ] | ||
| }, | ||
| { | ||
| "monitor_id": "system_workload", | ||
| "parent_monitor_id": "k8s_infrastructure", | ||
| "labels": [ | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ] | ||
| }, | ||
| { | ||
| "monitor_id": "kube_api_status", | ||
| "parent_monitor_id": "k8s_infrastructure", | ||
| "labels": [ | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ] | ||
| }, | ||
| { | ||
| "monitor_id": "namespace", | ||
| "labels": [ | ||
| "container.azm.ms/namespace", | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ], | ||
| "parent_monitor_id": "all_namespaces" | ||
| }, | ||
| { | ||
| "monitor_id": "k8s_infrastructure", | ||
| "parent_monitor_id": "cluster", | ||
| "labels": [ | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ] | ||
| }, | ||
| { | ||
| "monitor_id": "all_namespaces", | ||
| "parent_monitor_id": "all_workloads", | ||
| "labels": [ | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ] | ||
| }, | ||
| { | ||
| "monitor_id": "all_workloads", | ||
| "parent_monitor_id": "cluster", | ||
| "labels": [ | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ] | ||
| }, | ||
| { | ||
| "monitor_id": "node_cpu_utilization", | ||
| "parent_monitor_id": "node", | ||
| "labels": [ | ||
| "kubernetes.io/hostname", | ||
| "agentpool", | ||
| "kubernetes.io/role", | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ] | ||
| }, | ||
| { | ||
| "monitor_id": "node_memory_utilization", | ||
| "parent_monitor_id": "node", | ||
| "labels": [ | ||
| "kubernetes.io/hostname", | ||
| "agentpool", | ||
| "kubernetes.io/role", | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ] | ||
| }, | ||
| { | ||
| "monitor_id": "node_condition", | ||
| "parent_monitor_id": "node", | ||
| "labels": [ | ||
| "kubernetes.io/hostname", | ||
| "agentpool", | ||
| "kubernetes.io/role", | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ] | ||
| }, | ||
| { | ||
| "monitor_id": "node", | ||
| "aggregation_algorithm": "worstOf", | ||
| "labels": [ | ||
| "kubernetes.io/hostname", | ||
| "agentpool", | ||
| "kubernetes.io/role", | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ], | ||
| "parent_monitor_id": [ | ||
| { | ||
| "label": "kubernetes.io/role", | ||
| "operator": "==", | ||
| "value": "master", | ||
| "id": "master_node_pool" | ||
| }, | ||
| { | ||
| "label": "kubernetes.io/role", | ||
| "operator": "==", | ||
| "value": "agent", | ||
| "id": "agent_node_pool" | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "monitor_id": "master_node_pool", | ||
| "aggregation_algorithm": "percentage", | ||
| "aggregation_algorithm_params": { | ||
| "critical_threshold": 80.0, | ||
r-dilip marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| "warning_threshold": 90.0 | ||
| }, | ||
| "parent_monitor_id": "all_nodes", | ||
| "labels": [ | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ] | ||
| }, | ||
| { | ||
| "monitor_id": "agent_node_pool", | ||
| "aggregation_algorithm": "percentage", | ||
| "aggregation_algorithm_params": { | ||
| "state_threshold": 80.0 | ||
| }, | ||
| "labels": [ | ||
| "agentpool", | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ], | ||
| "parent_monitor_id": "all_nodes" | ||
| }, | ||
| { | ||
| "monitor_id": "all_nodes", | ||
| "aggregation_algorithm": "worstOf", | ||
| "parent_monitor_id": "cluster", | ||
| "labels": [ | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ] | ||
| }, | ||
| { | ||
| "monitor_id": "cluster", | ||
| "aggregation_algorithm": "worstOf", | ||
| "parent_monitor_id": null, | ||
| "labels": [ | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ] | ||
| }, | ||
| { | ||
| "monitor_id": "subscribed_capacity_cpu", | ||
| "parent_monitor_id": "capacity", | ||
| "labels": [ | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ] | ||
| }, | ||
| { | ||
| "monitor_id": "subscribed_capacity_memory", | ||
| "parent_monitor_id": "capacity", | ||
| "labels": [ | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ] | ||
| }, | ||
| { | ||
| "monitor_id": "capacity", | ||
| "parent_monitor_id": "all_workloads", | ||
| "labels": [ | ||
| "container.azm.ms/cluster-region", | ||
| "container.azm.ms/cluster-subscription-id", | ||
| "container.azm.ms/cluster-resource-group", | ||
| "container.azm.ms/cluster-name" | ||
| ] | ||
| } | ||
| ] | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| { | ||
| "node_cpu_utilization": { | ||
| "WarnThresholdPercentage": 80.0, | ||
| "FailThresholdPercentage": 90.0, | ||
| "ConsecutiveSamplesForStateTransition": 3 | ||
| }, | ||
| "node_memory_utilization": { | ||
| "WarnThresholdPercentage": 80.0, | ||
| "FailThresholdPercentage": 90.0, | ||
| "ConsecutiveSamplesForStateTransition": 3 | ||
| }, | ||
| "container_cpu_utilization": { | ||
| "WarnThresholdPercentage": 80.0, | ||
| "FailThresholdPercentage": 90.0, | ||
| "ConsecutiveSamplesForStateTransition": 3 | ||
| }, | ||
| "container_memory_utilization": { | ||
| "WarnThresholdPercentage": 80.0, | ||
| "FailThresholdPercentage": 90.0, | ||
| "ConsecutiveSamplesForStateTransition": 3 | ||
| }, | ||
| "user_workload_pods_ready": { | ||
| "WarnThresholdPercentage": 0.0, | ||
| "FailThresholdPercentage": 10.0, | ||
| "ConsecutiveSamplesForStateTransition": 2 | ||
| }, | ||
| "system_workload_pods_ready": { | ||
| "FailThresholdPercentage": 0.0, | ||
| "ConsecutiveSamplesForStateTransition": 2 | ||
| } | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.