Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
188 commits
Select commit Hold shift + click to select a range
3c5b46d
Updatng release history
vishiy Aug 1, 2018
d31f588
fixing the plugin logs for emit stream
Aug 1, 2018
11fd5f6
updating log message
Aug 5, 2018
87a9cf8
Remove Log Processing from fluentd configuration
r-dilip Aug 16, 2018
308be41
Remove plugin references from base_container.data
r-dilip Aug 16, 2018
5bee0af
Merge pull request #124 from Microsoft/dilipr/fluentdConfigUpdates
r-dilip Aug 30, 2018
bcd1a3f
Dilipr/fluent bit log processing (#126)
r-dilip Sep 14, 2018
b02f2ec
Dilipr/glide updates (#127)
r-dilip Sep 14, 2018
e01c678
containerID="" for pull issues
vishiy Sep 17, 2018
b0ba22d
Using KubeAPI for getting image,name. Adding more logs (#129)
r-dilip Sep 18, 2018
9783419
Dilipr/mark comments (#130)
r-dilip Sep 27, 2018
8e35b73
Rashmi/segfault latest (#132)
rashmichandrashekar Sep 27, 2018
4b63021
Adding a missed null check (#135)
rashmichandrashekar Sep 27, 2018
8b964fd
reusing some variables (#136)
rashmichandrashekar Sep 28, 2018
938c2ed
Rashmi/cjson delete null check (#138)
rashmichandrashekar Sep 28, 2018
fbfdf11
updating log level to debug for some provider workflows (#139)
rashmichandrashekar Oct 3, 2018
d426066
Fixing CPU Utilization and removing Fluent-bit filters (#140)
r-dilip Oct 4, 2018
c2cabab
Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. C…
r-dilip Oct 9, 2018
32567db
* Change FluentBit flush interval to 30 secs (from 5 secs)
vishiy Oct 10, 2018
afc981d
Container Log Telemetry
r-dilip Oct 12, 2018
4b958dd
Fixing an issue with Send Init Event if Telemetry is not initialized …
r-dilip Oct 12, 2018
510ef9f
PR feedback
r-dilip Oct 12, 2018
684c39b
PR feedback
r-dilip Oct 12, 2018
e165275
Sending an event every 5 mins(Heartbeat) (#146)
r-dilip Oct 15, 2018
eecb5db
Merge branch 'ci_feature_prod' into ci_feature
vishiy Oct 16, 2018
cfe1ca9
PR feedback to cleanup removed workflows
vishiy Oct 16, 2018
892b51c
updating agent version for telemetry
vishiy Oct 16, 2018
9c83160
updating agent version
vishiy Oct 17, 2018
f0b5a61
Telemetry Updates (#149)
r-dilip Oct 25, 2018
a58998e
Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159)
r-dilip Oct 30, 2018
4c2da9f
Rashmi/fluentd docker inventory (#160)
rashmichandrashekar Nov 5, 2018
6698fcd
Fix Telemetry Bug -- Initialize Telemetry Client after Initializing a…
r-dilip Nov 8, 2018
ad6bb93
Fix kube events memory leak due to yaml serialization for > 5k events…
vishiy Nov 12, 2018
eff92df
Setting Timeout for HTTP Client in PostDataHelper in outoms go plugi…
r-dilip Nov 14, 2018
9893e36
Vishwa/perftelemetry 2 (#165)
vishiy Nov 16, 2018
4f3c898
environment variable fix (#166)
rashmichandrashekar Nov 27, 2018
5e16467
Fixing a bug where we were crashing due to container statuses not pre…
vishiy Nov 27, 2018
b482b1e
Updating title
vishiy Nov 29, 2018
d75ba89
updating right versions for last release
vishiy Nov 29, 2018
cbd815c
Updating the break condition to look for end of response (#168)
rashmichandrashekar Nov 29, 2018
d0d5bf7
updating AgentVersion for telemetry
vishiy Nov 29, 2018
bfe27e5
Updating readme for latest release changes
vishiy Nov 29, 2018
5677560
Merge branch 'ci_feature_prod' into ci_feature
vishiy Nov 29, 2018
a621f88
Changes - (#173)
vishiy Dec 17, 2018
c9cf4fd
Rashmi/kubenodeinventory (#174)
rashmichandrashekar Dec 17, 2018
df6f122
Get cpuusage from usageseconds (#175)
vishiy Dec 20, 2018
dac9931
Rashmi/kubenodeinventory (#176)
rashmichandrashekar Dec 21, 2018
04cc1a8
Rashmi/kubenodeinventory (#178)
rashmichandrashekar Dec 26, 2018
5883f53
Fixing an issue on the cpurate metric, which happens for the first ti…
vishiy Dec 26, 2018
191f328
Rashmi/kubenodeinventory (#180)
rashmichandrashekar Dec 28, 2018
7e52e8c
Exclude docker containers from container inventory (#181)
rashmichandrashekar Jan 7, 2019
f0591f9
Exclude pauseamd64 containers from container inventory (#182)
rashmichandrashekar Jan 8, 2019
99e8813
Merge branch 'ci_feature_prod' into ci_feature
vishiy Jan 9, 2019
4782435
Update agent version
vishiy Jan 9, 2019
23bcc41
Updating readme for the latest release
vishiy Jan 9, 2019
51d5e93
Fix indentation in kube.conf and update readme (#184)
rashmichandrashekar Jan 11, 2019
decf86a
updating agent tag
rashmichandrashekar Jan 11, 2019
a1b35db
Get Pods for current Node Only (#185)
r-dilip Jan 29, 2019
22649ba
changes for container node inventory fixed type (#186)
rashmichandrashekar Jan 30, 2019
61e2eaf
Fix for mooncake (disable telemetry optionally) (#191)
vishiy Feb 13, 2019
30dff41
CustomMetrics to ci_feature (#193)
r-dilip Feb 15, 2019
f1b0cd2
add ContainerNotRunning column to KubePodInventory
bragi92 Jan 24, 2019
616a803
merge pr feedback: update name to ContainerStatusReason
bragi92 Jan 24, 2019
c33ca34
Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kuber…
r-dilip Feb 19, 2019
2651750
No Retries for non 404 4xx errors (#196)
r-dilip Feb 20, 2019
195bc33
Update agent version for telemetry
vishiy Feb 20, 2019
59d6c61
Update readme for upcoming (ciprod01202019) release
vishiy Feb 20, 2019
0189bc0
fix readme formatting
vishiy Feb 20, 2019
8221d2d
fix formatting for readme
vishiy Feb 20, 2019
30aa305
fix formatting for readme
vishiy Feb 20, 2019
f401116
fix readme
vishiy Feb 20, 2019
a2f45af
fix readme
vishiy Feb 21, 2019
759dbb5
fix agent version for telemetry
vishiy Feb 21, 2019
8bff5f9
Merge branch 'ci_feature_prod' into ci_feature
vishiy Feb 21, 2019
7956f40
fix date in readme
vishiy Feb 21, 2019
ee05656
update readme
vishiy Feb 21, 2019
2abcf67
Restart logs every 10MB instead of weekly (#198)
r-dilip Feb 21, 2019
18c107c
update agent version for telemetry
vishiy Feb 21, 2019
14b2b87
update readme
vishiy Feb 21, 2019
a1b551f
Merge branch 'ci_feature_prod' into ci_feature
vishiy Feb 21, 2019
5479dff
Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path
rashmichandrashekar Feb 22, 2019
cdded2e
Fix AKSEngine Crash (#200)
r-dilip Mar 4, 2019
57be1c4
hotfix
vishiy Mar 13, 2019
940a6eb
fix readme for new version
vishiy Mar 13, 2019
154fe56
Merge branch 'ci_feature_prod' into ci_feature
vishiy Mar 13, 2019
4115824
Fix the pod count in mdm agent plugin (#203)
r-dilip Mar 13, 2019
df2e64c
Update readme
vishiy Mar 13, 2019
cb90658
Merge branch 'ci_feature_prod' into ci_feature
vishiy Mar 13, 2019
19c2bc7
string freeze for out_mdm plugin
vishiy Mar 13, 2019
69935b3
Vishwa/resourcecentric (#208)
vishiy Apr 1, 2019
6953f50
Rashmi/win nodepool - PR (#206)
rashmichandrashekar Apr 1, 2019
ebdd8cc
adding os to container inventory for windows nodes (#210)
rashmichandrashekar Apr 8, 2019
d7b8cff
Fix omsagent crash Error when kube-api returns non-200, send events f…
r-dilip Apr 8, 2019
c9bb623
updating to lowercase compare for units (#212)
rashmichandrashekar Apr 10, 2019
3a88db8
Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214)
vishiy Apr 16, 2019
8cdf724
Fix telemetry error for telegraf err count metric (#215)
vishiy Apr 18, 2019
d2d5f0e
Merge branch 'ci_feature_prod' into ci_feature
vishiy Apr 18, 2019
36c8037
Fix Unscheduled Pod bug, remove excess telemetry (#218)
r-dilip May 31, 2019
803f934
Merge from Vishwa/promstandardmetrics into ci_feature (#220)
vishiy Jun 6, 2019
afc66b7
merge config/settings to ci_feature (#221)
vishiy Jun 6, 2019
727d5bd
Fix Scenario when Controller name is empty (#222)
r-dilip Jun 6, 2019
5e4b0f3
fix ;
vishiy Jun 7, 2019
6fefcac
ContainerLog collection optimizations (#223)
vishiy Jun 8, 2019
f87349e
merge final changes for release from Vishwa/june2019agentrel to ci_f…
vishiy Jun 10, 2019
195f82b
Merge branch 'ci_feature_prod' into ci_feature
vishiy Jun 10, 2019
8a412c1
fix fluent bit tuning for perf run (#226)
vishiy Jun 14, 2019
f613f2a
Merge branch 'ci_feature_prod' into ci_feature
vishiy Jun 14, 2019
e36b5ab
fix merge issue
vishiy Jun 14, 2019
8ba1f86
add release notes for june release in ci_feature branch
rashmichandrashekar Jun 21, 2019
e7e9e6d
fix title
rashmichandrashekar Jun 21, 2019
3903a9d
update
rashmichandrashekar Jun 21, 2019
f5b54fe
fix title
rashmichandrashekar Jun 21, 2019
1d32cec
Trim spaces in AKS_REGION (#233)
r-dilip Jul 5, 2019
5b8c52e
Add Logs Size To Telemetry (#234)
r-dilip Jul 9, 2019
5fc0f1b
Merge Vishwa/promcustommetrics to ci_feature (#237)
rashmichandrashekar Jul 9, 2019
5ab1944
Merge branch 'ci_feature_prod' into ci_feature
rashmichandrashekar Jul 9, 2019
4b8708b
Fix Region space error (#239)
r-dilip Jul 10, 2019
1cd9eee
Removing buffer chunk size and buffer max size from fluentbit conf (…
rashmichandrashekar Jul 10, 2019
e96e20a
Merge branch 'ci_feature_prod' into ci_feature
rashmichandrashekar Jul 10, 2019
788ab8b
changes (#243)
rashmichandrashekar Jul 11, 2019
5ee482b
Collect container last state (#235)
daweim0 Jul 15, 2019
378cc93
Rashmi/fix prom telemetry (#247)
rashmichandrashekar Aug 12, 2019
df60197
Merge Health Model work into ci_feature behind a feature flag Pending…
r-dilip Aug 14, 2019
4adcd8b
Fix Deserialization Bug (#249)
r-dilip Aug 16, 2019
2ee4307
Fix the bug where capacity is not updated and cached value was being …
r-dilip Aug 16, 2019
e86f82f
changes (#250)
rashmichandrashekar Aug 16, 2019
c76ce47
Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253)
r-dilip Aug 16, 2019
10a79c8
Add Missing Handlers (#254)
r-dilip Aug 19, 2019
851ab4e
Return MultiEventStream.new instead of empty array (#256)
r-dilip Aug 21, 2019
f20debb
Added explicit require_relative to avoid loading errors (#258)
r-dilip Aug 23, 2019
a8804df
Gangams/enable ai telemetry in mc (#252)
ganga1980 Aug 28, 2019
8a5ebb0
Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set s…
r-dilip Sep 10, 2019
a939bf7
Changes for creating custom plugins with namespace settings for prome…
rashmichandrashekar Sep 11, 2019
2a07233
Cherry-pick hotfix 09092019 to ci_feature (#265)
r-dilip Sep 12, 2019
2fee9fd
Gangams/add telemetry hybrid (#264)
ganga1980 Sep 23, 2019
5eea104
KubeMonAgentEvents changes to collect configuration events (#267)
rashmichandrashekar Oct 2, 2019
c472b12
Fix the Dupe Perf Data Issue from the DaemonSet (#266)
r-dilip Sep 26, 2019
98e4114
PR for 1. Container Memory CPU monitor 2. Configuration for Node Cond…
r-dilip Oct 3, 2019
382ed02
init containers fix and other bug fixes (#269)
rashmichandrashekar Oct 4, 2019
3079471
Send agg monitor signal on details change (#270)
r-dilip Oct 7, 2019
d16e2b0
resolving conflicts with ci_feature_prod
Oct 7, 2019
de2e1da
bug fixes for error (#274)
rashmichandrashekar Oct 10, 2019
e4b91c5
Fix to use declaration and assignment instead of assignment (#275)
rashmichandrashekar Oct 10, 2019
cf5e85c
1. Added telemetry (#277)
r-dilip Oct 10, 2019
e8529b2
Bug fix to remove unused variable (#281)
rashmichandrashekar Oct 10, 2019
1a41492
Merge branch 'ci_feature_prod' into ci_feature
rashmichandrashekar Oct 10, 2019
8a4147d
Fix the WARN<->WARNING typo (#282)
r-dilip Oct 11, 2019
ceb1a67
Merge branch 'ci_feature_prod' into ci_feature
Oct 11, 2019
4780c3e
Bug Fixes 1. telemetry send throwing exception if records not initia…
r-dilip Oct 14, 2019
a421c97
Merge branch 'ci_feature_prod' into ci_feature
r-dilip Oct 14, 2019
981018c
Fix Require relative revert (#287)
r-dilip Oct 18, 2019
41aca6e
Merge branch 'ci_feature_prod' into ci_feature
Oct 18, 2019
edaa963
Bug Fixes for exceptions in telemetry, remove limit set check (#289)
r-dilip Nov 1, 2019
568b2ed
Merge ci_feature_prod to ci_feature
r-dilip Nov 1, 2019
22bd43d
Fix the bug where if a warning condition appears before fail conditio…
r-dilip Nov 5, 2019
7cd9d76
Merge branch 'ci_feature_prod' into ci_feature
Nov 5, 2019
d1a2fbf
Merge branch 'ci_feature' of https://github.com/Microsoft/Docker-Prov…
Nov 5, 2019
920f101
Merge ci_feature_prod to ci_feature
r-dilip Nov 5, 2019
40f47a9
Fix for Nodes Aspect not showing up in draft cluster (#294)
r-dilip Nov 5, 2019
16055be
Fix the issue where the health tree is inconsistent if a deployment i…
r-dilip Nov 6, 2019
84f4aef
Merge branch 'ci_feature_prod' into ci_feature
Nov 6, 2019
2d861cc
Rashmi/1 16 test (#297)
rashmichandrashekar Nov 12, 2019
844afbd
Fix duplicate records in container memory/cpu samples (#298)
r-dilip Nov 12, 2019
9a8f0f8
Update MDM region list to include francecentral, japaneast and austra…
bragi92 Nov 14, 2019
597b2fb
Update MDM region list to include francecentral, japaneast and austra…
bragi92 Nov 14, 2019
cd1a37b
Send telemetry when there is error in calculation of state in percent…
r-dilip Nov 15, 2019
d6ea189
fix exceptions (#306)
rashmichandrashekar Nov 26, 2019
3df0ab6
Merge Branch morgan into ci_feature (#308)
vishiy Dec 4, 2019
8526802
Update Readme
vishiy Dec 4, 2019
c766d73
add back timeofcommand (#310)
vishiy Dec 4, 2019
81052ed
Merge branch 'ci_feature_prod' into ci_feature
vishiy Dec 4, 2019
8dfa313
update readme for timeofcommand fix (#314)
vishiy Dec 4, 2019
a0984af
Merge from ci_feature_prod into ci_feature (fix put back timeofcomman…
vishiy Dec 4, 2019
53a70cb
Merge branch 'ci_feature_prod' into ci_feature
vishiy Dec 4, 2019
deff7ac
Adding new cpu and memory limits to readme
rashmichandrashekar Dec 7, 2019
bf6b8a4
Merge branch 'ci_feature_prod' into ci_feature
Dec 7, 2019
4b1ef9c
CAdvisor to use 10255/10250 based on env variable (#321)
rashmichandrashekar Jan 7, 2020
6dc93e8
changing font for code change and customer impact
rashmichandrashekar Jan 7, 2020
f7732fb
Merge branch 'ci_feature_prod' into ci_feature
rashmichandrashekar Jan 7, 2020
044f13d
For ARO, stop collecting inventory of master and infra (#323)
ganga1980 Jan 24, 2020
acc1d27
MDM plugin support for large scale clusters (#324)
r-dilip Jan 28, 2020
0ea6c6e
Add Null check for kube api responses in in_kube_health (#325)
r-dilip Jan 29, 2020
843100c
Fix casing bug (#326)
r-dilip Feb 4, 2020
2c32e57
Missed kube.conf update (#327)
r-dilip Feb 7, 2020
b10fee9
changes to use msi if service principal does not exist (#328)
rashmichandrashekar Feb 21, 2020
f820075
Adding caseinsensitive compare (#330)
rashmichandrashekar Feb 24, 2020
03d90de
gpu monitoring (#329)
vishiy Feb 25, 2020
b0fc3ae
Merge branch 'ci_feature_prod' into ci_feature
Feb 25, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion installer/conf/container.conf
Original file line number Diff line number Diff line change
Expand Up @@ -110,5 +110,19 @@
retry_limit 10
retry_wait 5s
max_retry_wait 5m
retry_mdm_post_wait_minutes 60
retry_mdm_post_wait_minutes 30
</match>

<match oms.api.InsightsMetrics**>
type out_oms
log_level debug
num_threads 5
buffer_type file
buffer_path %STATE_DIR_WS%/out_oms_insightsmetrics*.buffer
buffer_queue_full_action drop_oldest_chunk
buffer_chunk_limit 4m
flush_interval 20s
retry_limit 10
retry_wait 5s
max_retry_wait 5m
</match>
20 changes: 18 additions & 2 deletions installer/conf/kube.conf
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
tag oms.containerinsights.KubePodInventory
run_interval 60
log_level debug
custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westeurope,southafricanorth,centralus,northcentralus,eastus2,koreacentral,eastasia,centralindia,uksouth,canadacentral,francecentral,japaneast,australiaeast
</source>

#Kubernetes events
Expand Down Expand Up @@ -47,7 +48,7 @@
log_level debug
</source>

<filter mdm.kubepodinventory** mdm.kubenodeinventory**>
<filter mdm.kubenodeinventory**>
type filter_inventory2mdm
custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westeurope,southafricanorth,centralus,northcentralus,eastus2,koreacentral,eastasia,centralindia,uksouth,canadacentral,francecentral,japaneast,australiaeast
log_level info
Expand Down Expand Up @@ -140,7 +141,7 @@
max_retry_wait 5m
</match>

<match oms.api.KubePerf**>
<match oms.api.KubePerf**>
type out_oms
log_level debug
num_threads 5
Expand Down Expand Up @@ -215,4 +216,19 @@
retry_limit 10
retry_wait 5s
max_retry_wait 5m
</match>

<match oms.api.InsightsMetrics**>
type out_oms
log_level debug
num_threads 5
buffer_chunk_limit 4m
buffer_type file
buffer_path %STATE_DIR_WS%/out_oms_insightsmetrics*.buffer
buffer_queue_limit 20
buffer_queue_full_action drop_oldest_chunk
flush_interval 20s
retry_limit 10
retry_wait 5s
max_retry_wait 5m
</match>
4 changes: 3 additions & 1 deletion installer/datafiles/base_container.data
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,10 @@ MAINTAINER: 'Microsoft Corporation'
/opt/microsoft/omsagent/plugin/in_win_cadvisor_perf.rb; source/code/plugin/in_win_cadvisor_perf.rb; 644; root; root
/opt/microsoft/omsagent/plugin/in_kube_nodes.rb; source/code/plugin/in_kube_nodes.rb; 644; root; root
/opt/microsoft/omsagent/plugin/filter_inventory2mdm.rb; source/code/plugin/filter_inventory2mdm.rb; 644; root; root
/opt/microsoft/omsagent/plugin/podinventory_to_mdm.rb; source/code/plugin/podinventory_to_mdm.rb; 644; root; root
/opt/microsoft/omsagent/plugin/kubelet_utils.rb; source/code/plugin/kubelet_utils.rb; 644; root; root
/opt/microsoft/omsagent/plugin/CustomMetricsUtils.rb; source/code/plugin/CustomMetricsUtils.rb; 644; root; root

/opt/microsoft/omsagent/plugin/constants.rb; source/code/plugin/constants.rb; 644; root; root

/opt/microsoft/omsagent/plugin/ApplicationInsightsUtility.rb; source/code/plugin/ApplicationInsightsUtility.rb; 644; root; root
/opt/microsoft/omsagent/plugin/ContainerInventoryState.rb; source/code/plugin/ContainerInventoryState.rb; 644; root; root
Expand Down
12 changes: 12 additions & 0 deletions installer/scripts/tomlparser.rb
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
@logExclusionRegexPattern = "(^((?!stdout|stderr).)*$)"
@excludePath = "*.csv2" #some invalid path
@enrichContainerLogs = false
@collectAllKubeEvents = false

# Use parser to parse the configmap toml file to a ruby structure
def parseConfigMap
Expand Down Expand Up @@ -128,6 +129,16 @@ def populateSettingValuesFromConfigMap(parsedConfig)
rescue => errorStr
ConfigParseErrorLogger.logError("Exception while reading config map settings for cluster level container log enrichment - #{errorStr}, using defaults, please check config map for errors")
end

#Get kube events enrichment setting
begin
if !parsedConfig[:log_collection_settings][:collect_all_kube_events].nil? && !parsedConfig[:log_collection_settings][:collect_all_kube_events][:enabled].nil?
@collectAllKubeEvents = parsedConfig[:log_collection_settings][:collect_all_kube_events][:enabled]
puts "config::Using config map setting for kube event collection"
end
rescue => errorStr
ConfigParseErrorLogger.logError("Exception while reading config map settings for kube event collection - #{errorStr}, using defaults, please check config map for errors")
end
end
end

Expand Down Expand Up @@ -168,6 +179,7 @@ def populateSettingValuesFromConfigMap(parsedConfig)
file.write("export AZMON_CLUSTER_COLLECT_ENV_VAR=#{@collectClusterEnvVariables}\n")
file.write("export AZMON_CLUSTER_LOG_TAIL_EXCLUDE_PATH=#{@excludePath}\n")
file.write("export AZMON_CLUSTER_CONTAINER_LOG_ENRICH=#{@enrichContainerLogs}\n")
file.write("export AZMON_CLUSTER_COLLECT_ALL_KUBE_EVENTS=#{@collectAllKubeEvents}\n")
# Close file after writing all environment variables
file.close
puts "Both stdout & stderr log collection are turned off for namespaces: '#{@excludePath}' "
Expand Down
237 changes: 176 additions & 61 deletions source/code/plugin/CAdvisorMetricsAPIClient.rb
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ class CAdvisorMetricsAPIClient
require_relative "oms_common"
require_relative "KubernetesApiClient"
require_relative "ApplicationInsightsUtility"
require_relative "constants"

@configMapMountPath = "/etc/config/settings/log-data-collection-settings"
@promConfigMountPath = "/etc/config/settings/prometheus-data-collection-settings"
Expand Down Expand Up @@ -55,85 +56,58 @@ class CAdvisorMetricsAPIClient
# Keeping track of containers so that can delete the container from the container cpu cache when the container is deleted
# as a part of the cleanup routine
@@winContainerIdCache = []

#cadvisor ports
@@CADVISOR_SECURE_PORT = "10250"
@@CADVISOR_NON_SECURE_PORT = "10255"
def initialize
end

class << self
def getSummaryStatsFromCAdvisor(winNode)
headers = {}
response = nil
@Log.info "Getting CAdvisor Uri"
begin
cAdvisorSecurePort = false
# Check to see if omsagent needs to use 10255(insecure) port or 10250(secure) port
if !@cAdvisorMetricsSecurePort.nil? && @cAdvisorMetricsSecurePort == "true"
cAdvisorSecurePort = true
end

cAdvisorUri = getCAdvisorUri(winNode, cAdvisorSecurePort)
bearerToken = File.read("/var/run/secrets/kubernetes.io/serviceaccount/token")
@Log.info "cAdvisorUri: #{cAdvisorUri}"
relativeUri = "/stats/summary"
return getResponse(winNode, relativeUri)
end

if !cAdvisorUri.nil?
uri = URI.parse(cAdvisorUri)
if !!cAdvisorSecurePort == true
Net::HTTP.start(uri.host, uri.port,
:use_ssl => true, :open_timeout => 20, :read_timeout => 40,
:ca_file => "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt",
:verify_mode => OpenSSL::SSL::VERIFY_NONE) do |http|
cAdvisorApiRequest = Net::HTTP::Get.new(uri.request_uri)
cAdvisorApiRequest["Authorization"] = "Bearer #{bearerToken}"
response = http.request(cAdvisorApiRequest)
@Log.info "Got response code #{response.code} from #{uri.request_uri}"
end
else
Net::HTTP.start(uri.host, uri.port, :use_ssl => false, :open_timeout => 20, :read_timeout => 40) do |http|
cAdvisorApiRequest = Net::HTTP::Get.new(uri.request_uri)
response = http.request(cAdvisorApiRequest)
@Log.info "Got response code #{response.code} from #{uri.request_uri}"
end
end
end
rescue => error
@Log.warn("CAdvisor api request failed: #{error}")
telemetryProps = {}
telemetryProps["Computer"] = winNode["Hostname"]
ApplicationInsightsUtility.sendExceptionTelemetry(error, telemetryProps)
end
return response
def getNodeCapacityFromCAdvisor(winNode: nil)
relativeUri = "/spec/"
return getResponse(winNode, relativeUri)
end

def getCAdvisorUri(winNode, cAdvisorSecurePort)
begin
def getBaseCAdvisorUri(winNode)
cAdvisorSecurePort = isCAdvisorOnSecurePort()

if !!cAdvisorSecurePort == true
defaultHost = "https://localhost:10250"
defaultHost = "https://localhost:#{@@CADVISOR_SECURE_PORT}"
else
defaultHost = "http://localhost:10255"
defaultHost = "http://localhost:#{@@CADVISOR_NON_SECURE_PORT}"
end

relativeUri = "/stats/summary"
if !winNode.nil?
nodeIP = winNode["InternalIP"]
nodeIP = winNode["InternalIP"]
else
nodeIP = ENV["NODE_IP"]
nodeIP = ENV["NODE_IP"]
end

if !nodeIP.nil?
@Log.info("Using #{nodeIP + relativeUri} for CAdvisor Uri")
if !!cAdvisorSecurePort == true
return "https://#{nodeIP}:10250" + relativeUri
else
return "http://#{nodeIP}:10255" + relativeUri
end
@Log.info("Using #{nodeIP} for CAdvisor Host")
if !!cAdvisorSecurePort == true
return "https://#{nodeIP}:#{@@CADVISOR_SECURE_PORT}"
else
return "http://#{nodeIP}:#{@@CADVISOR_NON_SECURE_PORT}"
end
else
@Log.warn ("NODE_IP environment variable not set. Using default as : #{defaultHost + relativeUri} ")
if !winNode.nil?
return nil
else
return defaultHost + relativeUri
end
@Log.warn ("NODE_IP environment variable not set. Using default as : #{defaultHost}")
if !winNode.nil?
return nil
else
return defaultHost
end
end
end
end

def getCAdvisorUri(winNode, relativeUri)
baseUri = getBaseCAdvisorUri(winNode)
return baseUri + relativeUri
end

def getMetrics(winNode: nil, metricTime: Time.now.utc.iso8601)
Expand Down Expand Up @@ -282,6 +256,101 @@ def getContainerCpuMetricItems(metricJSON, hostName, cpuMetricNameToCollect, met
return metricItems
end

def getInsightsMetrics(winNode: nil, metricTime: Time.now.utc.iso8601)
metricDataItems = []
begin
cAdvisorStats = getSummaryStatsFromCAdvisor(winNode)
if !cAdvisorStats.nil?
metricInfo = JSON.parse(cAdvisorStats.body)
end
if !winNode.nil?
hostName = winNode["Hostname"]
operatingSystem = "Windows"
else
if !metricInfo.nil? && !metricInfo["node"].nil? && !metricInfo["node"]["nodeName"].nil?
hostName = metricInfo["node"]["nodeName"]
else
hostName = (OMS::Common.get_hostname)
end
operatingSystem = "Linux"
end
if !metricInfo.nil?
metricDataItems.concat(getContainerGpuMetricsAsInsightsMetrics(metricInfo, hostName, "memoryTotal", "containerGpumemoryTotalBytes", metricTime))
metricDataItems.concat(getContainerGpuMetricsAsInsightsMetrics(metricInfo, hostName, "memoryUsed","containerGpumemoryUsedBytes", metricTime))
metricDataItems.concat(getContainerGpuMetricsAsInsightsMetrics(metricInfo, hostName, "dutyCycle","containerGpuDutyCycle", metricTime))
else
@Log.warn("Couldn't get Insights metrics information for host: #{hostName} os:#{operatingSystem}")
end
rescue => error
@Log.warn("CAdvisorMetricsAPIClient::getInsightsMetrics failed: #{error}")
return metricDataItems
end
return metricDataItems
end

def getContainerGpuMetricsAsInsightsMetrics(metricJSON, hostName, metricNameToCollect, metricNametoReturn, metricPollTime)
metricItems = []
clusterId = KubernetesApiClient.getClusterId
clusterName = KubernetesApiClient.getClusterName
begin
metricInfo = metricJSON
metricInfo["pods"].each do |pod|
podUid = pod["podRef"]["uid"]
podName = pod["podRef"]["name"]
podNamespace = pod["podRef"]["namespace"]

if (!pod["containers"].nil?)
pod["containers"].each do |container|
#gpu metrics
if (!container["accelerators"].nil?)
container["accelerators"].each do |accelerator|
if (!accelerator[metricNameToCollect].nil?) #empty check is invalid for non-strings
containerName = container["name"]
metricValue = accelerator[metricNameToCollect]


metricItem = {}
metricItem["CollectionTime"] = metricPollTime
metricItem["Computer"] = hostName
metricItem["Name"] = metricNametoReturn
metricItem["Value"] = metricValue
metricItem["Origin"] = Constants::INSIGHTSMETRICS_TAGS_ORIGIN
metricItem["Namespace"] = Constants::INSIGHTSMETRICS_TAGS_GPU_NAMESPACE

metricTags = {}
metricTags[Constants::INSIGHTSMETRICS_TAGS_CLUSTERID ] = clusterId
metricTags[Constants::INSIGHTSMETRICS_TAGS_CLUSTERNAME] = clusterName
metricTags[Constants::INSIGHTSMETRICS_TAGS_CONTAINER_NAME] = podUid + "/" + containerName
#metricTags[Constants::INSIGHTSMETRICS_TAGS_K8SNAMESPACE] = podNameSpace

if (!accelerator["make"].nil? && !accelerator["make"].empty?)
metricTags[Constants::INSIGHTSMETRICS_TAGS_GPU_VENDOR] = accelerator["make"]
end

if (!accelerator["model"].nil? && !accelerator["model"].empty?)
metricTags[Constants::INSIGHTSMETRICS_TAGS_GPU_MODEL] = accelerator["model"]
end

if (!accelerator["id"].nil? && !accelerator["id"].empty?)
metricTags[Constants::INSIGHTSMETRICS_TAGS_GPU_ID] = accelerator["id"]
end

metricItem["Tags"] = metricTags

metricItems.push(metricItem)
end
end
end
end
end
end
rescue => errorStr
@Log.warn("getContainerGpuMetricsAsInsightsMetrics failed: #{errorStr} for metric #{metricNameToCollect}")
return metricItems
end
return metricItems
end

def clearDeletedWinContainersFromCache()
begin
winCpuUsageNanoSecondsKeys = @@winContainerCpuUsageNanoSecondsLast.keys
Expand Down Expand Up @@ -696,5 +765,51 @@ def getContainerStartTimeMetricItems(metricJSON, hostName, metricNametoReturn, m
end
return metricItems
end

def getResponse(winNode, relativeUri)
response = nil
@Log.info "Getting CAdvisor Uri Response"
bearerToken = File.read("/var/run/secrets/kubernetes.io/serviceaccount/token")
begin
cAdvisorUri = getCAdvisorUri(winNode, relativeUri)
@Log.info "cAdvisorUri: #{cAdvisorUri}"

if !cAdvisorUri.nil?
uri = URI.parse(cAdvisorUri)
if isCAdvisorOnSecurePort()
Net::HTTP.start(uri.host, uri.port,
:use_ssl => true, :open_timeout => 20, :read_timeout => 40,
:ca_file => "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt",
:verify_mode => OpenSSL::SSL::VERIFY_NONE) do |http|
cAdvisorApiRequest = Net::HTTP::Get.new(uri.request_uri)
cAdvisorApiRequest["Authorization"] = "Bearer #{bearerToken}"
response = http.request(cAdvisorApiRequest)
@Log.info "Got response code #{response.code} from #{uri.request_uri}"
end
else
Net::HTTP.start(uri.host, uri.port, :use_ssl => false, :open_timeout => 20, :read_timeout => 40) do |http|
cAdvisorApiRequest = Net::HTTP::Get.new(uri.request_uri)
response = http.request(cAdvisorApiRequest)
@Log.info "Got response code #{response.code} from #{uri.request_uri}"
end
end
end
rescue => error
@Log.warn("CAdvisor api request for #{cAdvisorUri} failed: #{error}")
telemetryProps = {}
telemetryProps["Computer"] = winNode["Hostname"]
ApplicationInsightsUtility.sendExceptionTelemetry(error, telemetryProps)
end
return response
end

def isCAdvisorOnSecurePort
cAdvisorSecurePort = false
# Check to see whether omsagent needs to use 10255(insecure) port or 10250(secure) port
if !@cAdvisorMetricsSecurePort.nil? && @cAdvisorMetricsSecurePort == "true"
cAdvisorSecurePort = true
end
return cAdvisorSecurePort
end
end
end
Loading