Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
6d7199f
changes
Feb 9, 2019
3048d13
changes
Feb 9, 2019
7bd5eac
changes
Feb 9, 2019
6f62c6c
changes
Feb 9, 2019
109ee9d
changes
Feb 9, 2019
8d585fd
changes
Feb 9, 2019
f6b1e02
changes
Feb 12, 2019
c19eb1d
changes
Feb 12, 2019
fd3cfb5
changes
Feb 12, 2019
8e16e32
changes
Feb 12, 2019
9453d70
changes
Feb 15, 2019
4514780
chg
Feb 15, 2019
fc8a1ed
changes for kubelet health
Feb 16, 2019
0e28a4e
changes
Feb 16, 2019
627de84
changes to include message
Feb 20, 2019
4efb7ac
Merge ci_feature into node-health-perf
r-dilip Feb 26, 2019
14ab446
First iteration of health monitor signals
r-dilip Mar 8, 2019
b8dcc52
Fixed Bugs for NotifyInstantly Monitor
r-dilip Mar 9, 2019
1a2d8ce
Health and Input plugins, logs cleaned up
r-dilip Mar 9, 2019
e0c431e
Hooking up input and filter plugins to out_oms_api plugin
r-dilip Mar 11, 2019
654c7c9
1. Tag Changes 2. Adding Health Monitor Configuration 3. Added Agent …
r-dilip Mar 12, 2019
c5e739f
Merge branch 'ci_feature' into dilipr/node-health-perf
r-dilip Mar 12, 2019
b60ee71
Fix Base Container.data, include kube-system containers, fix input pl…
r-dilip Mar 12, 2019
99b7fe1
More fixes to config, process kube-system
r-dilip Mar 12, 2019
fd5bbf6
Adding in_kube_health
r-dilip Mar 12, 2019
8b9fd53
Merging after pulling
r-dilip Mar 12, 2019
e667406
Send Node_name parameter to reduceSignal for node level monitors
r-dilip Mar 12, 2019
31a3931
Fix Typo in method invocation
r-dilip Mar 14, 2019
c380a5e
1. Added pod_status monitor (unused), 2. Removed processing for conta…
r-dilip Mar 20, 2019
c61a3e6
Merge branch 'dilipr/kubeHealth' of https://github.com/Microsoft/Dock…
r-dilip Mar 20, 2019
b68572f
Fix issue when pods are created since last kube api
r-dilip Mar 29, 2019
b80366e
Remove duplicate plugin entry from container.conf
r-dilip Apr 11, 2019
bac932d
Merging after fixing conflicts from ci_feature
r-dilip Apr 11, 2019
c6d0fee
Updating Agent Version in fluent-bit config
r-dilip Apr 11, 2019
42957df
Updating Agent Version
r-dilip Apr 12, 2019
8ab3ee8
Fix Error when Pods dont have a controller
r-dilip Apr 17, 2019
a8837f8
Add Telemetry for plugin start
r-dilip Apr 26, 2019
f6e9c0e
Merge from ci_feature
r-dilip Apr 26, 2019
147d688
Change getMonitorInstanceId method signature
r-dilip Apr 30, 2019
80a5d36
Remove references to HealthMonitorRecord struct in code
r-dilip May 1, 2019
d7a71d5
Rake
May 7, 2019
a18eb83
Running Ruby tests
r-dilip May 7, 2019
c7a0a50
Merge branch 'dilipr/rubyTest' into dilipr/healthModelAggregation
r-dilip May 7, 2019
5895951
Working Version for Health Model Builder on the agent
r-dilip May 22, 2019
8892458
Calculate old and new states for Aggregate and Unit Monitors
r-dilip May 22, 2019
81df39f
Remove Controller Name from labels and details, use Deployment/Daemon…
r-dilip May 23, 2019
deda155
Change label namespaces, remove ClusterName from records sent, send d…
r-dilip May 29, 2019
a57b0b4
Configuration Split for Monitors
r-dilip May 29, 2019
9dbc7a8
working version for 2 pods before naming changes
r-dilip May 30, 2019
0f9f5d4
Working Model Builder version after name changes, TODO: test on the a…
r-dilip May 30, 2019
2f7be02
E2E working version for health model aggregation TODO: Missing Signal…
r-dilip May 30, 2019
0f210f5
Change pod-aggregator to workload-name, remove node monitor hierarchy…
r-dilip Jun 5, 2019
b89b107
Refactor signal reduction logic
r-dilip Jun 13, 2019
adb8f94
Missing Pod signals/Node Signals send none or unknown based on the in…
r-dilip Jun 14, 2019
876bb3c
serialization and deserialization of state
r-dilip Jun 14, 2019
7c459c4
Working cadvisor_health_node filter
r-dilip Jun 17, 2019
497c26a
working version E2E with state serialization and deserialization
r-dilip Jun 18, 2019
f3520fe
adding source, health config to base_container.data
r-dilip Jun 18, 2019
bc57eb2
Container conf changes, permissions for log files etc.
r-dilip Jun 18, 2019
ded1867
Merge branch 'ci_feature' into dilipr/refactorSignalReduction
r-dilip Jun 18, 2019
88621c7
Reinstate run_interval that was removed accidentally
r-dilip Jun 18, 2019
966a0b1
Remove single sample flip configs, fixed details.to_json bug, pass in…
r-dilip Jun 19, 2019
5edf616
Remove unnecessary logging
r-dilip Jun 19, 2019
09063ba
Fix Aggregation logic for 'percentage' agg algorithm monitors
r-dilip Jun 20, 2019
aba0d17
Scale up Scale down bugs fixed, sending none signal on first occurenc…
r-dilip Jun 21, 2019
24b0479
Enable state initialization, fix bug where records are always sent th…
r-dilip Jun 21, 2019
d3d267a
Fix percentage agg algorithm state calculation
r-dilip Jun 22, 2019
d0f4a7b
Fix the bug where if signal is unknown state, its state is not update…
r-dilip Jun 22, 2019
990f70c
fix compute percentage bug when value is in warning state
r-dilip Jun 22, 2019
275fcf3
Update state_transition_time to current time whenever state change ha…
r-dilip Jun 22, 2019
2901e99
Update missing signal state to be the instance state for correct rollup
r-dilip Jun 24, 2019
bd7cf0a
1. Remove some unnecessary logging
r-dilip Jun 25, 2019
23fa7a2
Removing calls to kube api since they are not required as of now. Wil…
r-dilip Jun 25, 2019
1697f40
Send telemetry for cluster level state changes
r-dilip Jun 27, 2019
ec65d49
Testing Rake
r-dilip Jul 8, 2019
d0a62d3
First Round of Tests
r-dilip Jul 17, 2019
2e50407
added integration tests for aks and aks-engine
r-dilip Jul 17, 2019
8af3554
committing missing renamed file
r-dilip Jul 17, 2019
02bce13
Fix base_Container.data
r-dilip Jul 17, 2019
0d4ae84
Added test_helpers.rb
r-dilip Jul 17, 2019
60384df
Fix ruby 1.9 issue where __dir__is not recognized
r-dilip Jul 17, 2019
ce8c748
moving some methods into health_monitor_helpers, so that unit tests c…
r-dilip Jul 17, 2019
c70cfe7
Changed references to health_monitor_helpers
r-dilip Jul 17, 2019
603ab25
Fixing ruby incompatibility errors
r-dilip Jul 17, 2019
338b752
Dont load health_monitor_utils
r-dilip Jul 17, 2019
dd8dfef
Dumm commit to force pull
r-dilip Jul 17, 2019
c161fc1
remove non existent file from base_container.data, update Makefile
r-dilip Jul 17, 2019
142a5a5
Updated tomlparser.rb to handle agent_settings for health_model
r-dilip Jul 30, 2019
d415f07
Fixing merge conflicts from ci_feature
r-dilip Jul 30, 2019
d9f2e4e
Toggle health plugins based on Feature flag
r-dilip Jul 30, 2019
7b09fcf
Added health_monitor_helpers, and fixed log
r-dilip Jul 30, 2019
fa5e31d
Send start telemetry only if health model is enabled
r-dilip Aug 1, 2019
0cf6870
PRfeedback
r-dilip Aug 7, 2019
5d92eee
Renamed offending file name that was causing ruby to fail loading
r-dilip Aug 7, 2019
69e9aac
change name in base_container
r-dilip Aug 7, 2019
6917ea2
Remove non existent file
r-dilip Aug 7, 2019
ea46649
Add health_monitor_helpers
r-dilip Aug 7, 2019
28f1ccb
Merge branch 'dilipr/testInfra' into dilipr/mergeHealthToCiFeature
r-dilip Aug 7, 2019
0cd2b80
Use CRD for state persistence (#248)
r-dilip Aug 13, 2019
25ed658
Merge branch 'dilipr/mergeHealthToCiFeature' of https://github.com/Mi…
r-dilip Aug 13, 2019
56bb430
Fixing merge conflict from ci_feature
r-dilip Aug 13, 2019
8d14026
Dummy update
r-dilip Aug 14, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions Rakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
require 'rake/testtask'

task default: "test"

Rake::TestTask.new do |task|
task.libs << "test"
task.pattern = './test/code/plugin/health/*_spec.rb'
task.warning = false
end
13 changes: 11 additions & 2 deletions build/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -91,9 +91,9 @@ CXXFLAGS = $(COMPILE_FLAGS)
# Build targets

ifeq ($(ULINUX),1)
all : $(OMI_ROOT)/output $(SCXPAL_INTERMEDIATE_DIR) PROVIDER_STATUS $(PROVIDER_LIBRARY) KIT_STATUS kit fluentbitplugin
all : $(OMI_ROOT)/output $(SCXPAL_INTERMEDIATE_DIR) PROVIDER_STATUS $(PROVIDER_LIBRARY) KIT_STATUS kit fluentbitplugin rubypluginstests
else
all : $(OMI_ROOT)/output $(SCXPAL_INTERMEDIATE_DIR) PROVIDER_STATUS $(PROVIDER_LIBRARY) fluentbitplugin
all : $(OMI_ROOT)/output $(SCXPAL_INTERMEDIATE_DIR) PROVIDER_STATUS $(PROVIDER_LIBRARY) fluentbitplugin rubypluginstests
endif

clean :
Expand Down Expand Up @@ -143,6 +143,15 @@ fluentbitplugin :
make -C $(GO_SOURCE_DIR) fbplugin
$(COPY) $(GO_SOURCE_DIR)/out_oms.so $(INTERMEDIATE_DIR)

rubypluginstests :
@echo "========================= Installing pre-reqs for running tests"
sudo apt-add-repository ppa:brightbox/ruby-ng -y
sudo apt-get update
sudo apt-get install ruby2.4 rake -y
sudo gem install minitest
@echo "========================= Running tests..."
rake test

#--------------------------------------------------------------------------------
# PAL build
#
Expand Down
33 changes: 29 additions & 4 deletions installer/conf/container.conf
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,22 @@

#cadvisor perf
<source>
type cadvisorperf
tag oms.api.cadvisorperf
run_interval 60s
type cadvisorperf
tag oms.api.cadvisorperf
run_interval 60s
log_level debug
</source>

<filter oms.api.KubeHealth.DaemonSet.Node**>
type filter_cadvisor_health_node
log_level debug
</filter>


#custom_metrics_mdm filter plugin
<filter mdm.cadvisorperf**>
type filter_cadvisor2mdm
custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westEurope
custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westeurope
metrics_to_collect cpuUsageNanoCores,memoryWorkingSetBytes,memoryRssBytes
log_level info
</filter>
Expand Down Expand Up @@ -61,6 +67,25 @@
max_retry_wait 9m
</match>


<match oms.api.KubeHealth.DaemonSet**>
@type forward
send_timeout 60s
recover_wait 10s
hard_timeout 60s
heartbeat_type tcp

<server>
host healthmodel-replicaset-service.kube-system
port 25227
</server>

<secondary>
@type file
path /var/opt/microsoft/docker-cimprov/log/fluent_forward_failed.log
</secondary>
</match>

<match mdm.cadvisorperf**>
type out_mdm
log_level debug
Expand Down
248 changes: 248 additions & 0 deletions installer/conf/health_model_definition.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
[
{
"monitor_id": "user_workload_pods_ready",
"parent_monitor_id": "user_workload",
"labels": [
"container.azm.ms/namespace",
"container.azm.ms/workload-name",
"container.azm.ms/workload-kind",
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
]
},
{
"monitor_id": "user_workload",
"parent_monitor_id": "namespace",
"labels": [
"container.azm.ms/namespace",
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
]
},
{
"monitor_id": "system_workload_pods_ready",
"parent_monitor_id": "system_workload",
"labels": [
"container.azm.ms/namespace",
"container.azm.ms/workload-name",
"container.azm.ms/workload-kind",
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
]
},
{
"monitor_id": "system_workload",
"parent_monitor_id": "k8s_infrastructure",
"labels": [
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
]
},
{
"monitor_id": "kube_api_status",
"parent_monitor_id": "k8s_infrastructure",
"labels": [
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
]
},
{
"monitor_id": "namespace",
"labels": [
"container.azm.ms/namespace",
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
],
"parent_monitor_id": "all_namespaces"
},
{
"monitor_id": "k8s_infrastructure",
"parent_monitor_id": "cluster",
"labels": [
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
]
},
{
"monitor_id": "all_namespaces",
"parent_monitor_id": "all_workloads",
"labels": [
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
]
},
{
"monitor_id": "all_workloads",
"parent_monitor_id": "cluster",
"labels": [
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
]
},
{
"monitor_id": "node_cpu_utilization",
"parent_monitor_id": "node",
"labels": [
"kubernetes.io/hostname",
"agentpool",
"kubernetes.io/role",
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
]
},
{
"monitor_id": "node_memory_utilization",
"parent_monitor_id": "node",
"labels": [
"kubernetes.io/hostname",
"agentpool",
"kubernetes.io/role",
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
]
},
{
"monitor_id": "node_condition",
"parent_monitor_id": "node",
"labels": [
"kubernetes.io/hostname",
"agentpool",
"kubernetes.io/role",
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
]
},
{
"monitor_id": "node",
"aggregation_algorithm": "worstOf",
"labels": [
"kubernetes.io/hostname",
"agentpool",
"kubernetes.io/role",
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
],
"parent_monitor_id": [
{
"label": "kubernetes.io/role",
"operator": "==",
"value": "master",
"id": "master_node_pool"
},
{
"label": "kubernetes.io/role",
"operator": "==",
"value": "agent",
"id": "agent_node_pool"
}
]
},
{
"monitor_id": "master_node_pool",
"aggregation_algorithm": "percentage",
"aggregation_algorithm_params": {
"critical_threshold": 80.0,
"warning_threshold": 90.0
},
"parent_monitor_id": "all_nodes",
"labels": [
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
]
},
{
"monitor_id": "agent_node_pool",
"aggregation_algorithm": "percentage",
"aggregation_algorithm_params": {
"state_threshold": 80.0
},
"labels": [
"agentpool",
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
],
"parent_monitor_id": "all_nodes"
},
{
"monitor_id": "all_nodes",
"aggregation_algorithm": "worstOf",
"parent_monitor_id": "cluster",
"labels": [
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
]
},
{
"monitor_id": "cluster",
"aggregation_algorithm": "worstOf",
"parent_monitor_id": null,
"labels": [
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
]
},
{
"monitor_id": "subscribed_capacity_cpu",
"parent_monitor_id": "capacity",
"labels": [
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
]
},
{
"monitor_id": "subscribed_capacity_memory",
"parent_monitor_id": "capacity",
"labels": [
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
]
},
{
"monitor_id": "capacity",
"parent_monitor_id": "all_workloads",
"labels": [
"container.azm.ms/cluster-region",
"container.azm.ms/cluster-subscription-id",
"container.azm.ms/cluster-resource-group",
"container.azm.ms/cluster-name"
]
}
]
31 changes: 31 additions & 0 deletions installer/conf/healthmonitorconfig.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"node_cpu_utilization": {
"WarnThresholdPercentage": 80.0,
"FailThresholdPercentage": 90.0,
"ConsecutiveSamplesForStateTransition": 3
},
"node_memory_utilization": {
"WarnThresholdPercentage": 80.0,
"FailThresholdPercentage": 90.0,
"ConsecutiveSamplesForStateTransition": 3
},
"container_cpu_utilization": {
"WarnThresholdPercentage": 80.0,
"FailThresholdPercentage": 90.0,
"ConsecutiveSamplesForStateTransition": 3
},
"container_memory_utilization": {
"WarnThresholdPercentage": 80.0,
"FailThresholdPercentage": 90.0,
"ConsecutiveSamplesForStateTransition": 3
},
"user_workload_pods_ready": {
"WarnThresholdPercentage": 0.0,
"FailThresholdPercentage": 10.0,
"ConsecutiveSamplesForStateTransition": 2
},
"system_workload_pods_ready": {
"FailThresholdPercentage": 0.0,
"ConsecutiveSamplesForStateTransition": 2
}
}
Loading