Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
198 commits
Select commit Hold shift + click to select a range
3c5b46d
Updatng release history
vishiy Aug 1, 2018
d31f588
fixing the plugin logs for emit stream
Aug 1, 2018
11fd5f6
updating log message
Aug 5, 2018
87a9cf8
Remove Log Processing from fluentd configuration
r-dilip Aug 16, 2018
308be41
Remove plugin references from base_container.data
r-dilip Aug 16, 2018
5bee0af
Merge pull request #124 from Microsoft/dilipr/fluentdConfigUpdates
r-dilip Aug 30, 2018
bcd1a3f
Dilipr/fluent bit log processing (#126)
r-dilip Sep 14, 2018
b02f2ec
Dilipr/glide updates (#127)
r-dilip Sep 14, 2018
e01c678
containerID="" for pull issues
vishiy Sep 17, 2018
b0ba22d
Using KubeAPI for getting image,name. Adding more logs (#129)
r-dilip Sep 18, 2018
9783419
Dilipr/mark comments (#130)
r-dilip Sep 27, 2018
8e35b73
Rashmi/segfault latest (#132)
rashmichandrashekar Sep 27, 2018
4b63021
Adding a missed null check (#135)
rashmichandrashekar Sep 27, 2018
8b964fd
reusing some variables (#136)
rashmichandrashekar Sep 28, 2018
938c2ed
Rashmi/cjson delete null check (#138)
rashmichandrashekar Sep 28, 2018
fbfdf11
updating log level to debug for some provider workflows (#139)
rashmichandrashekar Oct 3, 2018
d426066
Fixing CPU Utilization and removing Fluent-bit filters (#140)
r-dilip Oct 4, 2018
c2cabab
Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. C…
r-dilip Oct 9, 2018
32567db
* Change FluentBit flush interval to 30 secs (from 5 secs)
vishiy Oct 10, 2018
afc981d
Container Log Telemetry
r-dilip Oct 12, 2018
4b958dd
Fixing an issue with Send Init Event if Telemetry is not initialized …
r-dilip Oct 12, 2018
510ef9f
PR feedback
r-dilip Oct 12, 2018
684c39b
PR feedback
r-dilip Oct 12, 2018
e165275
Sending an event every 5 mins(Heartbeat) (#146)
r-dilip Oct 15, 2018
eecb5db
Merge branch 'ci_feature_prod' into ci_feature
vishiy Oct 16, 2018
cfe1ca9
PR feedback to cleanup removed workflows
vishiy Oct 16, 2018
892b51c
updating agent version for telemetry
vishiy Oct 16, 2018
9c83160
updating agent version
vishiy Oct 17, 2018
f0b5a61
Telemetry Updates (#149)
r-dilip Oct 25, 2018
a58998e
Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159)
r-dilip Oct 30, 2018
4c2da9f
Rashmi/fluentd docker inventory (#160)
rashmichandrashekar Nov 5, 2018
6698fcd
Fix Telemetry Bug -- Initialize Telemetry Client after Initializing a…
r-dilip Nov 8, 2018
ad6bb93
Fix kube events memory leak due to yaml serialization for > 5k events…
vishiy Nov 12, 2018
eff92df
Setting Timeout for HTTP Client in PostDataHelper in outoms go plugi…
r-dilip Nov 14, 2018
9893e36
Vishwa/perftelemetry 2 (#165)
vishiy Nov 16, 2018
4f3c898
environment variable fix (#166)
rashmichandrashekar Nov 27, 2018
5e16467
Fixing a bug where we were crashing due to container statuses not pre…
vishiy Nov 27, 2018
b482b1e
Updating title
vishiy Nov 29, 2018
d75ba89
updating right versions for last release
vishiy Nov 29, 2018
cbd815c
Updating the break condition to look for end of response (#168)
rashmichandrashekar Nov 29, 2018
d0d5bf7
updating AgentVersion for telemetry
vishiy Nov 29, 2018
bfe27e5
Updating readme for latest release changes
vishiy Nov 29, 2018
5677560
Merge branch 'ci_feature_prod' into ci_feature
vishiy Nov 29, 2018
a621f88
Changes - (#173)
vishiy Dec 17, 2018
c9cf4fd
Rashmi/kubenodeinventory (#174)
rashmichandrashekar Dec 17, 2018
df6f122
Get cpuusage from usageseconds (#175)
vishiy Dec 20, 2018
dac9931
Rashmi/kubenodeinventory (#176)
rashmichandrashekar Dec 21, 2018
04cc1a8
Rashmi/kubenodeinventory (#178)
rashmichandrashekar Dec 26, 2018
5883f53
Fixing an issue on the cpurate metric, which happens for the first ti…
vishiy Dec 26, 2018
191f328
Rashmi/kubenodeinventory (#180)
rashmichandrashekar Dec 28, 2018
7e52e8c
Exclude docker containers from container inventory (#181)
rashmichandrashekar Jan 7, 2019
f0591f9
Exclude pauseamd64 containers from container inventory (#182)
rashmichandrashekar Jan 8, 2019
99e8813
Merge branch 'ci_feature_prod' into ci_feature
vishiy Jan 9, 2019
4782435
Update agent version
vishiy Jan 9, 2019
23bcc41
Updating readme for the latest release
vishiy Jan 9, 2019
51d5e93
Fix indentation in kube.conf and update readme (#184)
rashmichandrashekar Jan 11, 2019
decf86a
updating agent tag
rashmichandrashekar Jan 11, 2019
a1b35db
Get Pods for current Node Only (#185)
r-dilip Jan 29, 2019
22649ba
changes for container node inventory fixed type (#186)
rashmichandrashekar Jan 30, 2019
61e2eaf
Fix for mooncake (disable telemetry optionally) (#191)
vishiy Feb 13, 2019
30dff41
CustomMetrics to ci_feature (#193)
r-dilip Feb 15, 2019
f1b0cd2
add ContainerNotRunning column to KubePodInventory
bragi92 Jan 24, 2019
616a803
merge pr feedback: update name to ContainerStatusReason
bragi92 Jan 24, 2019
c33ca34
Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kuber…
r-dilip Feb 19, 2019
2651750
No Retries for non 404 4xx errors (#196)
r-dilip Feb 20, 2019
195bc33
Update agent version for telemetry
vishiy Feb 20, 2019
59d6c61
Update readme for upcoming (ciprod01202019) release
vishiy Feb 20, 2019
0189bc0
fix readme formatting
vishiy Feb 20, 2019
8221d2d
fix formatting for readme
vishiy Feb 20, 2019
30aa305
fix formatting for readme
vishiy Feb 20, 2019
f401116
fix readme
vishiy Feb 20, 2019
a2f45af
fix readme
vishiy Feb 21, 2019
759dbb5
fix agent version for telemetry
vishiy Feb 21, 2019
8bff5f9
Merge branch 'ci_feature_prod' into ci_feature
vishiy Feb 21, 2019
7956f40
fix date in readme
vishiy Feb 21, 2019
ee05656
update readme
vishiy Feb 21, 2019
2abcf67
Restart logs every 10MB instead of weekly (#198)
r-dilip Feb 21, 2019
18c107c
update agent version for telemetry
vishiy Feb 21, 2019
14b2b87
update readme
vishiy Feb 21, 2019
a1b551f
Merge branch 'ci_feature_prod' into ci_feature
vishiy Feb 21, 2019
5479dff
Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path
rashmichandrashekar Feb 22, 2019
cdded2e
Fix AKSEngine Crash (#200)
r-dilip Mar 4, 2019
57be1c4
hotfix
vishiy Mar 13, 2019
940a6eb
fix readme for new version
vishiy Mar 13, 2019
154fe56
Merge branch 'ci_feature_prod' into ci_feature
vishiy Mar 13, 2019
4115824
Fix the pod count in mdm agent plugin (#203)
r-dilip Mar 13, 2019
df2e64c
Update readme
vishiy Mar 13, 2019
cb90658
Merge branch 'ci_feature_prod' into ci_feature
vishiy Mar 13, 2019
19c2bc7
string freeze for out_mdm plugin
vishiy Mar 13, 2019
69935b3
Vishwa/resourcecentric (#208)
vishiy Apr 1, 2019
6953f50
Rashmi/win nodepool - PR (#206)
rashmichandrashekar Apr 1, 2019
ebdd8cc
adding os to container inventory for windows nodes (#210)
rashmichandrashekar Apr 8, 2019
d7b8cff
Fix omsagent crash Error when kube-api returns non-200, send events f…
r-dilip Apr 8, 2019
c9bb623
updating to lowercase compare for units (#212)
rashmichandrashekar Apr 10, 2019
3a88db8
Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214)
vishiy Apr 16, 2019
8cdf724
Fix telemetry error for telegraf err count metric (#215)
vishiy Apr 18, 2019
d2d5f0e
Merge branch 'ci_feature_prod' into ci_feature
vishiy Apr 18, 2019
36c8037
Fix Unscheduled Pod bug, remove excess telemetry (#218)
r-dilip May 31, 2019
803f934
Merge from Vishwa/promstandardmetrics into ci_feature (#220)
vishiy Jun 6, 2019
afc66b7
merge config/settings to ci_feature (#221)
vishiy Jun 6, 2019
727d5bd
Fix Scenario when Controller name is empty (#222)
r-dilip Jun 6, 2019
5e4b0f3
fix ;
vishiy Jun 7, 2019
6fefcac
ContainerLog collection optimizations (#223)
vishiy Jun 8, 2019
f87349e
merge final changes for release from Vishwa/june2019agentrel to ci_f…
vishiy Jun 10, 2019
195f82b
Merge branch 'ci_feature_prod' into ci_feature
vishiy Jun 10, 2019
8a412c1
fix fluent bit tuning for perf run (#226)
vishiy Jun 14, 2019
f613f2a
Merge branch 'ci_feature_prod' into ci_feature
vishiy Jun 14, 2019
e36b5ab
fix merge issue
vishiy Jun 14, 2019
8ba1f86
add release notes for june release in ci_feature branch
rashmichandrashekar Jun 21, 2019
e7e9e6d
fix title
rashmichandrashekar Jun 21, 2019
3903a9d
update
rashmichandrashekar Jun 21, 2019
f5b54fe
fix title
rashmichandrashekar Jun 21, 2019
1d32cec
Trim spaces in AKS_REGION (#233)
r-dilip Jul 5, 2019
5b8c52e
Add Logs Size To Telemetry (#234)
r-dilip Jul 9, 2019
5fc0f1b
Merge Vishwa/promcustommetrics to ci_feature (#237)
rashmichandrashekar Jul 9, 2019
5ab1944
Merge branch 'ci_feature_prod' into ci_feature
rashmichandrashekar Jul 9, 2019
4b8708b
Fix Region space error (#239)
r-dilip Jul 10, 2019
1cd9eee
Removing buffer chunk size and buffer max size from fluentbit conf (…
rashmichandrashekar Jul 10, 2019
e96e20a
Merge branch 'ci_feature_prod' into ci_feature
rashmichandrashekar Jul 10, 2019
788ab8b
changes (#243)
rashmichandrashekar Jul 11, 2019
5ee482b
Collect container last state (#235)
daweim0 Jul 15, 2019
378cc93
Rashmi/fix prom telemetry (#247)
rashmichandrashekar Aug 12, 2019
df60197
Merge Health Model work into ci_feature behind a feature flag Pending…
r-dilip Aug 14, 2019
4adcd8b
Fix Deserialization Bug (#249)
r-dilip Aug 16, 2019
2ee4307
Fix the bug where capacity is not updated and cached value was being …
r-dilip Aug 16, 2019
e86f82f
changes (#250)
rashmichandrashekar Aug 16, 2019
c76ce47
Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253)
r-dilip Aug 16, 2019
10a79c8
Add Missing Handlers (#254)
r-dilip Aug 19, 2019
851ab4e
Return MultiEventStream.new instead of empty array (#256)
r-dilip Aug 21, 2019
f20debb
Added explicit require_relative to avoid loading errors (#258)
r-dilip Aug 23, 2019
a8804df
Gangams/enable ai telemetry in mc (#252)
ganga1980 Aug 28, 2019
8a5ebb0
Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set s…
r-dilip Sep 10, 2019
a939bf7
Changes for creating custom plugins with namespace settings for prome…
rashmichandrashekar Sep 11, 2019
2a07233
Cherry-pick hotfix 09092019 to ci_feature (#265)
r-dilip Sep 12, 2019
2fee9fd
Gangams/add telemetry hybrid (#264)
ganga1980 Sep 23, 2019
5eea104
KubeMonAgentEvents changes to collect configuration events (#267)
rashmichandrashekar Oct 2, 2019
c472b12
Fix the Dupe Perf Data Issue from the DaemonSet (#266)
r-dilip Sep 26, 2019
98e4114
PR for 1. Container Memory CPU monitor 2. Configuration for Node Cond…
r-dilip Oct 3, 2019
382ed02
init containers fix and other bug fixes (#269)
rashmichandrashekar Oct 4, 2019
3079471
Send agg monitor signal on details change (#270)
r-dilip Oct 7, 2019
d16e2b0
resolving conflicts with ci_feature_prod
Oct 7, 2019
de2e1da
bug fixes for error (#274)
rashmichandrashekar Oct 10, 2019
e4b91c5
Fix to use declaration and assignment instead of assignment (#275)
rashmichandrashekar Oct 10, 2019
cf5e85c
1. Added telemetry (#277)
r-dilip Oct 10, 2019
e8529b2
Bug fix to remove unused variable (#281)
rashmichandrashekar Oct 10, 2019
1a41492
Merge branch 'ci_feature_prod' into ci_feature
rashmichandrashekar Oct 10, 2019
8a4147d
Fix the WARN<->WARNING typo (#282)
r-dilip Oct 11, 2019
ceb1a67
Merge branch 'ci_feature_prod' into ci_feature
Oct 11, 2019
4780c3e
Bug Fixes 1. telemetry send throwing exception if records not initia…
r-dilip Oct 14, 2019
a421c97
Merge branch 'ci_feature_prod' into ci_feature
r-dilip Oct 14, 2019
981018c
Fix Require relative revert (#287)
r-dilip Oct 18, 2019
41aca6e
Merge branch 'ci_feature_prod' into ci_feature
Oct 18, 2019
edaa963
Bug Fixes for exceptions in telemetry, remove limit set check (#289)
r-dilip Nov 1, 2019
568b2ed
Merge ci_feature_prod to ci_feature
r-dilip Nov 1, 2019
22bd43d
Fix the bug where if a warning condition appears before fail conditio…
r-dilip Nov 5, 2019
7cd9d76
Merge branch 'ci_feature_prod' into ci_feature
Nov 5, 2019
d1a2fbf
Merge branch 'ci_feature' of https://github.com/Microsoft/Docker-Prov…
Nov 5, 2019
920f101
Merge ci_feature_prod to ci_feature
r-dilip Nov 5, 2019
40f47a9
Fix for Nodes Aspect not showing up in draft cluster (#294)
r-dilip Nov 5, 2019
16055be
Fix the issue where the health tree is inconsistent if a deployment i…
r-dilip Nov 6, 2019
84f4aef
Merge branch 'ci_feature_prod' into ci_feature
Nov 6, 2019
2d861cc
Rashmi/1 16 test (#297)
rashmichandrashekar Nov 12, 2019
844afbd
Fix duplicate records in container memory/cpu samples (#298)
r-dilip Nov 12, 2019
9a8f0f8
Update MDM region list to include francecentral, japaneast and austra…
bragi92 Nov 14, 2019
597b2fb
Update MDM region list to include francecentral, japaneast and austra…
bragi92 Nov 14, 2019
cd1a37b
Send telemetry when there is error in calculation of state in percent…
r-dilip Nov 15, 2019
d6ea189
fix exceptions (#306)
rashmichandrashekar Nov 26, 2019
3df0ab6
Merge Branch morgan into ci_feature (#308)
vishiy Dec 4, 2019
8526802
Update Readme
vishiy Dec 4, 2019
c766d73
add back timeofcommand (#310)
vishiy Dec 4, 2019
81052ed
Merge branch 'ci_feature_prod' into ci_feature
vishiy Dec 4, 2019
8dfa313
update readme for timeofcommand fix (#314)
vishiy Dec 4, 2019
a0984af
Merge from ci_feature_prod into ci_feature (fix put back timeofcomman…
vishiy Dec 4, 2019
53a70cb
Merge branch 'ci_feature_prod' into ci_feature
vishiy Dec 4, 2019
deff7ac
Adding new cpu and memory limits to readme
rashmichandrashekar Dec 7, 2019
bf6b8a4
Merge branch 'ci_feature_prod' into ci_feature
Dec 7, 2019
4b1ef9c
CAdvisor to use 10255/10250 based on env variable (#321)
rashmichandrashekar Jan 7, 2020
6dc93e8
changing font for code change and customer impact
rashmichandrashekar Jan 7, 2020
f7732fb
Merge branch 'ci_feature_prod' into ci_feature
rashmichandrashekar Jan 7, 2020
044f13d
For ARO, stop collecting inventory of master and infra (#323)
ganga1980 Jan 24, 2020
acc1d27
MDM plugin support for large scale clusters (#324)
r-dilip Jan 28, 2020
0ea6c6e
Add Null check for kube api responses in in_kube_health (#325)
r-dilip Jan 29, 2020
843100c
Fix casing bug (#326)
r-dilip Feb 4, 2020
2c32e57
Missed kube.conf update (#327)
r-dilip Feb 7, 2020
b10fee9
changes to use msi if service principal does not exist (#328)
rashmichandrashekar Feb 21, 2020
f820075
Adding caseinsensitive compare (#330)
rashmichandrashekar Feb 24, 2020
03d90de
gpu monitoring (#329)
vishiy Feb 25, 2020
b0fc3ae
Merge branch 'ci_feature_prod' into ci_feature
Feb 25, 2020
7942706
Update release notes
rashmichandrashekar Feb 26, 2020
3bcc81e
Merge branch 'ci_feature_prod' into ci_feature
Feb 26, 2020
d6398bc
MDM batch Bug (#336)
r-dilip Feb 29, 2020
13c6dfc
kube evnts bug fix (#335)
rashmichandrashekar Mar 2, 2020
11f4968
Update readme.md
rashmichandrashekar Mar 2, 2020
2ddd16c
Rate Limiting changes (#338)
vishiy Mar 19, 2020
f9087d4
Gangams/add support cri runtime docker env (#337)
ganga1980 Mar 24, 2020
dc42bfa
Gangams/fix cri exceptions (#339)
ganga1980 Apr 8, 2020
0c265a4
fix log exception (#340)
ganga1980 Apr 8, 2020
b0e652e
Updating release notes, mdm bug fix (#333) (#343)
ganga1980 Apr 17, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,25 @@ additional questions or comments.

Note : The agent version(s) below has dates (ciprod<mmddyyyy>), which indicate the agent build dates (not release dates)

### 03/02/2020 -
##### Version microsoft/oms:ciprod03022020 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod03022020
##### Code change log
- Collection of GPU metrics as InsightsMetrics
- Enable config map settings to enable collection of 'Normal' kube events
- Fix kubehealth exceptions to handle empty/nil kube api responses
- Get resource limits for health and MDM from kubelet instead of kube api
- Bug fix for windows node image collection where image name contains multiple slashes
- Exclude ARO master node for data collection
- Telemetry for kube events flushed
- Changes to support msi for mdm if service principal doesnt exist
- Changes for AKS telemetry to ping ods endpoint first and then network check
- KubeEvents bug fix for KubeEvent type

##### Customer Impact
- Providing capability for customers to collect 'Normal' kube events using config map
- Metrics for GPU are collected and ingested to customers workspace if they have GPU enabled nodes
- Bug fix for windows container image collection allows customers to get the right data in the ContainerInventory table for windows containers.

### 01/07/2020 -
##### Version microsoft/oms:ciprod01072020 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod01072020
##### Code change log
Expand Down
19 changes: 19 additions & 0 deletions installer/conf/azm-containers-parser.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
# Command | Decoder | Field | Optional Action
# =============|==================|=================
Decode_Field_As escaped log

[PARSER]
# http://rubular.com/r/tjUt3Awgg4
Name cri
Format regex
Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<log>.*)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On

2 changes: 1 addition & 1 deletion installer/conf/td-agent-bit-rs.conf
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[SERVICE]
Flush 30
Log_Level info
Parsers_File /etc/td-agent-bit/parsers.conf
Parsers_File /etc/opt/microsoft/docker-cimprov/azm-containers-parser.conf
Log_File /var/opt/microsoft/docker-cimprov/log/fluent-bit.log

[INPUT]
Expand Down
2 changes: 1 addition & 1 deletion installer/conf/td-agent-bit.conf
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
#Default service flush interval is 15 seconds
${SERVICE_FLUSH_INTERVAL}
Log_Level info
Parsers_File /etc/td-agent-bit/parsers.conf
Parsers_File /etc/opt/microsoft/docker-cimprov/azm-containers-parser.conf
Log_File /var/opt/microsoft/docker-cimprov/log/fluent-bit.log

[INPUT]
Expand Down
2 changes: 1 addition & 1 deletion installer/conf/telegraf-rs.conf
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@

## Default flushing interval for all outputs. You shouldn't set this below
## interval. Maximum flush_interval will be flush_interval + flush_jitter
flush_interval = "60s"
flush_interval = "15s"
## Jitter the flush interval by a random amount. This is primarily to avoid
## large write spikes for users running a large number of telegraf instances.
## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
Expand Down
11 changes: 6 additions & 5 deletions installer/conf/telegraf.conf
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@

## Default flushing interval for all outputs. You shouldn't set this below
## interval. Maximum flush_interval will be flush_interval + flush_jitter
flush_interval = "60s"
flush_interval = "15s"
## Jitter the flush interval by a random amount. This is primarily to avoid
## large write spikes for users running a large number of telegraf instances.
## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
Expand Down Expand Up @@ -386,7 +386,7 @@
# report_active = true
# fieldpass = ["usage_active","cluster","node","host","device"]
# taginclude = ["cluster","cpu","node"]



# Read metrics about disk usage by mount point
Expand All @@ -395,7 +395,7 @@
## By default stats will be gathered for all mount points.
## Set mount_points will restrict the stats to only the specified mount points.
# mount_points = ["/"]

## Ignore mount points by filesystem type.
ignore_fs = ["tmpfs", "devtmpfs", "devfs", "overlay", "aufs", "squashfs"]
fieldpass = ["free", "used", "used_percent"]
Expand Down Expand Up @@ -532,7 +532,8 @@
name_prefix="container.azm.ms/"
## An array of urls to scrape metrics from.
urls = ["$CADVISOR_METRICS_URL"]
fieldpass = ["kubelet_docker_operations", "kubelet_docker_operations_errors"]
## Include "$KUBELET_RUNTIME_OPERATIONS_TOTAL_METRIC", "$KUBELET_RUNTIME_OPERATIONS_ERRORS_TOTAL_METRIC" when we add for support for 1.18
fieldpass = ["$KUBELET_RUNTIME_OPERATIONS_METRIC", "$KUBELET_RUNTIME_OPERATIONS_ERRORS_METRIC"]

metric_version = 2
url_tag = "scrapeUrl"
Expand Down Expand Up @@ -578,7 +579,7 @@
urls = $AZMON_DS_PROM_URLS

fieldpass = $AZMON_DS_PROM_FIELDPASS

fielddrop = $AZMON_DS_PROM_FIELDDROP

metric_version = 2
Expand Down
3 changes: 3 additions & 0 deletions installer/datafiles/base_container.data
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,8 @@ MAINTAINER: 'Microsoft Corporation'
/opt/microsoft/omsagent/plugin/DockerApiClient.rb; source/code/plugin/DockerApiClient.rb; 644; root; root
/opt/microsoft/omsagent/plugin/DockerApiRestHelper.rb; source/code/plugin/DockerApiRestHelper.rb; 644; root; root
/opt/microsoft/omsagent/plugin/in_containerinventory.rb; source/code/plugin/in_containerinventory.rb; 644; root; root
/opt/microsoft/omsagent/plugin/kubernetes_container_inventory.rb; source/code/plugin/kubernetes_container_inventory.rb; 644; root; root


/opt/microsoft/omsagent/plugin/out_mdm.rb; source/code/plugin/out_mdm.rb; 644; root; root
/opt/microsoft/omsagent/plugin/filter_cadvisor2mdm.rb; source/code/plugin/filter_cadvisor2mdm.rb; 644; root; root
Expand Down Expand Up @@ -106,6 +108,7 @@ MAINTAINER: 'Microsoft Corporation'
/opt/td-agent-bit/bin/out_oms.so; intermediate/${{BUILD_CONFIGURATION}}/out_oms.so; 755; root; root
/etc/opt/microsoft/docker-cimprov/td-agent-bit.conf; installer/conf/td-agent-bit.conf; 644; root; root
/etc/opt/microsoft/docker-cimprov/td-agent-bit-rs.conf; installer/conf/td-agent-bit-rs.conf; 644; root; root
/etc/opt/microsoft/docker-cimprov/azm-containers-parser.conf; installer/conf/azm-containers-parser.conf; 644; root; root
/etc/opt/microsoft/docker-cimprov/out_oms.conf; installer/conf/out_oms.conf; 644; root; root
/etc/opt/microsoft/docker-cimprov/telegraf.conf; installer/conf/telegraf.conf; 644; root; root
/etc/opt/microsoft/docker-cimprov/telegraf-rs.conf; installer/conf/telegraf-rs.conf; 644; root; root
Expand Down
11 changes: 9 additions & 2 deletions source/code/go/src/plugins/glide.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions source/code/go/src/plugins/glide.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ import:
- package: github.com/fluent/fluent-bit-go
subpackages:
- output
- package: github.com/google/uuid
version: ^1.1.0
- package: gopkg.in/natefinch/lumberjack.v2
version: ^2.1.0
- package: k8s.io/apimachinery
Expand Down
63 changes: 52 additions & 11 deletions source/code/go/src/plugins/oms.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ import (
"time"

"github.com/fluent/fluent-bit-go/output"
"github.com/google/uuid"

lumberjack "gopkg.in/natefinch/lumberjack.v2"

Expand All @@ -37,6 +38,9 @@ const ResourceIdEnv = "AKS_RESOURCE_ID"
//env variable which has ResourceName for NON-AKS
const ResourceNameEnv = "ACS_RESOURCE_NAME"

//env variable which has container run time name
const ContainerRuntimeEnv = "CONTAINER_RUNTIME"

// Origin prefix for telegraf Metrics (used as prefix for origin field & prefix for azure monitor specific tags and also for custom-metrics telemetry )
const TelegrafMetricOriginPrefix = "container.azm.ms"

Expand Down Expand Up @@ -93,7 +97,9 @@ var (
//KubeMonAgentEvents skip first flush
skipKubeMonEventsFlush bool
// enrich container logs (when true this will add the fields - timeofcommand, containername & containerimage)
enrichContainerLogs bool
enrichContainerLogs bool
// container runtime engine configured on the kubelet
containerRuntime string
)

var (
Expand Down Expand Up @@ -133,6 +139,12 @@ var (
Log = FLBLogger.Printf
)

var (
dockerCimprovVersion = "9.0.0.0"
agentName = "ContainerAgent"
userAgent = ""
)

// DataItem represents the object corresponding to the json that is sent by fluentbit tail plugin
type DataItem struct {
LogEntry string `json:"LogEntry"`
Expand Down Expand Up @@ -513,6 +525,9 @@ func flushKubeMonAgentEventRecords() {
} else {
req, _ := http.NewRequest("POST", OMSEndpoint, bytes.NewBuffer(marshalled))
req.Header.Set("Content-Type", "application/json")
req.Header.Set("User-Agent", userAgent )
reqId := uuid.New().String()
req.Header.Set("X-Request-ID", reqId)
//expensive to do string len for every request, so use a flag
if ResourceCentric == true {
req.Header.Set("x-ms-AzureResourceId", ResourceID)
Expand All @@ -527,7 +542,7 @@ func flushKubeMonAgentEventRecords() {
Log("Failed to flush %d records after %s", len(laKubeMonAgentEventsRecords), elapsed)
} else if resp == nil || resp.StatusCode != 200 {
if resp != nil {
Log("Status %s Status Code %d", resp.Status, resp.StatusCode)
Log(" RequestId %s Status %s Status Code %d", reqId, resp.Status, resp.StatusCode)
}
Log("Failed to flush %d records after %s", len(laKubeMonAgentEventsRecords), elapsed)
} else {
Expand Down Expand Up @@ -656,6 +671,9 @@ func PostTelegrafMetricsToLA(telegrafRecords []map[interface{}]interface{}) int

//set headers
req.Header.Set("x-ms-date", time.Now().Format(time.RFC3339))
req.Header.Set("User-Agent", userAgent )
reqId := uuid.New().String()
req.Header.Set("X-Request-ID", reqId)

//expensive to do string len for every request, so use a flag
if ResourceCentric == true {
Expand All @@ -669,31 +687,34 @@ func PostTelegrafMetricsToLA(telegrafRecords []map[interface{}]interface{}) int
if err != nil {
message := fmt.Sprintf("PostTelegrafMetricsToLA::Error:(retriable) when sending %v metrics. duration:%v err:%q \n", len(laMetrics), elapsed, err.Error())
Log(message)
UpdateNumTelegrafMetricsSentTelemetry(0, 1)
UpdateNumTelegrafMetricsSentTelemetry(0, 1, 0)
return output.FLB_RETRY
}

if resp == nil || resp.StatusCode != 200 {
if resp != nil {
Log("PostTelegrafMetricsToLA::Error:(retriable) Response Status %v Status Code %v", resp.Status, resp.StatusCode)
Log("PostTelegrafMetricsToLA::Error:(retriable) RequestID %s Response Status %v Status Code %v", reqId, resp.Status, resp.StatusCode)
}
if resp != nil && resp.StatusCode == 429 {
UpdateNumTelegrafMetricsSentTelemetry(0, 1, 1)
}
UpdateNumTelegrafMetricsSentTelemetry(0, 1)
return output.FLB_RETRY
}

defer resp.Body.Close()

numMetrics := len(laMetrics)
UpdateNumTelegrafMetricsSentTelemetry(numMetrics, 0)
UpdateNumTelegrafMetricsSentTelemetry(numMetrics, 0, 0)
Log("PostTelegrafMetricsToLA::Info:Successfully flushed %v records in %v", numMetrics, elapsed)

return output.FLB_OK
}

func UpdateNumTelegrafMetricsSentTelemetry(numMetricsSent int, numSendErrors int) {
func UpdateNumTelegrafMetricsSentTelemetry(numMetricsSent int, numSendErrors int, numSend429Errors int) {
ContainerLogTelemetryMutex.Lock()
TelegrafMetricsSentCount += float64(numMetricsSent)
TelegrafMetricsSendErrorCount += float64(numSendErrors)
TelegrafMetricsSend429ErrorCount += float64(numSend429Errors)
ContainerLogTelemetryMutex.Unlock()
}

Expand Down Expand Up @@ -734,9 +755,11 @@ func PostDataHelper(tailPluginRecords []map[interface{}]interface{}) int {

stringMap := make(map[string]string)

stringMap["LogEntry"] = ToString(record["log"])
logEntry := ToString(record["log"])
logEntryTimeStamp := ToString(record["time"])
stringMap["LogEntry"] = logEntry
stringMap["LogEntrySource"] = logEntrySource
stringMap["LogEntryTimeStamp"] = ToString(record["time"])
stringMap["LogEntryTimeStamp"] = logEntryTimeStamp
stringMap["SourceSystem"] = "Containers"
stringMap["Id"] = containerID

Expand Down Expand Up @@ -781,7 +804,7 @@ func PostDataHelper(tailPluginRecords []map[interface{}]interface{}) int {
if dataItem.LogEntryTimeStamp != "" {
loggedTime, e := time.Parse(time.RFC3339, dataItem.LogEntryTimeStamp)
if e != nil {
message := fmt.Sprintf("Error while converting LogEntryTimeStamp for telemetry purposes: %s", e.Error())
message := fmt.Sprintf("containerId: %s Error while converting LogEntryTimeStamp for telemetry purposes: %s", dataItem.ID, e.Error())
Log(message)
SendException(message)
} else {
Expand Down Expand Up @@ -810,6 +833,9 @@ func PostDataHelper(tailPluginRecords []map[interface{}]interface{}) int {

req, _ := http.NewRequest("POST", OMSEndpoint, bytes.NewBuffer(marshalled))
req.Header.Set("Content-Type", "application/json")
req.Header.Set("User-Agent", userAgent )
reqId := uuid.New().String()
req.Header.Set("X-Request-ID", reqId)
//expensive to do string len for every request, so use a flag
if ResourceCentric == true {
req.Header.Set("x-ms-AzureResourceId", ResourceID)
Expand All @@ -830,7 +856,7 @@ func PostDataHelper(tailPluginRecords []map[interface{}]interface{}) int {

if resp == nil || resp.StatusCode != 200 {
if resp != nil {
Log("Status %s Status Code %d", resp.Status, resp.StatusCode)
Log("RequestId %s Status %s Status Code %d", reqId, resp.Status, resp.StatusCode)
}
return output.FLB_RETRY
}
Expand Down Expand Up @@ -958,6 +984,21 @@ func InitializePlugin(pluginConfPath string, agentVersion string) {
Log("ResourceID=%s", ResourceID)
Log("ResourceName=%s", ResourceName)
}

// log runtime info for debug purpose
containerRuntime = os.Getenv(ContainerRuntimeEnv)
Log("Container Runtime engine %s", containerRuntime)


//set useragent to be used by ingestion
docker_cimprov_version := strings.TrimSpace(os.Getenv("DOCKER_CIMPROV_VERSION"))
if len(docker_cimprov_version) > 0 {
dockerCimprovVersion = docker_cimprov_version
}

userAgent = fmt.Sprintf("%s/%s", agentName, dockerCimprovVersion)

Log("Usage-Agent = %s \n", userAgent)

// Initialize image,name map refresh ticker
containerInventoryRefreshInterval, err := strconv.Atoi(pluginConfig["container_inventory_refresh_interval"])
Expand Down
12 changes: 11 additions & 1 deletion source/code/go/src/plugins/telemetry.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ var (
TelegrafMetricsSentCount float64
//Tracks the number of send errors between telemetry ticker periods (uses ContainerLogTelemetryTicker)
TelegrafMetricsSendErrorCount float64
//Tracks the number of 429 (throttle) errors between telemetry ticker periods (uses ContainerLogTelemetryTicker)
TelegrafMetricsSend429ErrorCount float64
)

const (
Expand All @@ -49,6 +51,7 @@ const (
metricNameAgentLogProcessingMaxLatencyMs = "ContainerLogsAgentSideLatencyMs"
metricNameNumberofTelegrafMetricsSentSuccessfully = "TelegrafMetricsSentCount"
metricNameNumberofSendErrorsTelegrafMetrics = "TelegrafMetricsSendErrorCount"
metricNameNumberofSend429ErrorsTelegrafMetrics = "TelegrafMetricsSend429ErrorCount"

defaultTelemetryPushIntervalSeconds = 300

Expand Down Expand Up @@ -78,8 +81,10 @@ func SendContainerLogPluginMetrics(telemetryPushIntervalProperty string) {
logSizeRate := FlushedRecordsSize / float64(elapsed/time.Second)
telegrafMetricsSentCount := TelegrafMetricsSentCount
telegrafMetricsSendErrorCount := TelegrafMetricsSendErrorCount
telegrafMetricsSend429ErrorCount := TelegrafMetricsSend429ErrorCount
TelegrafMetricsSentCount = 0.0
TelegrafMetricsSendErrorCount = 0.0
TelegrafMetricsSend429ErrorCount = 0.0
FlushedRecordsCount = 0.0
FlushedRecordsSize = 0.0
FlushedRecordsTimeTaken = 0.0
Expand All @@ -103,7 +108,12 @@ func SendContainerLogPluginMetrics(telemetryPushIntervalProperty string) {
TelemetryClient.Track(logLatencyMetric)
}
TelemetryClient.Track(appinsights.NewMetricTelemetry(metricNameNumberofTelegrafMetricsSentSuccessfully, telegrafMetricsSentCount))
TelemetryClient.Track(appinsights.NewMetricTelemetry(metricNameNumberofSendErrorsTelegrafMetrics, telegrafMetricsSendErrorCount))
if telegrafMetricsSendErrorCount > 0.0 {
TelemetryClient.Track(appinsights.NewMetricTelemetry(metricNameNumberofSendErrorsTelegrafMetrics, telegrafMetricsSendErrorCount))
}
if telegrafMetricsSend429ErrorCount > 0.0 {
TelemetryClient.Track(appinsights.NewMetricTelemetry(metricNameNumberofSend429ErrorsTelegrafMetrics, telegrafMetricsSend429ErrorCount))
}
start = time.Now()
}
}
Expand Down
Loading