Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
116 commits
Select commit Hold shift + click to select a range
3c5b46d
Updatng release history
vishiy Aug 1, 2018
d31f588
fixing the plugin logs for emit stream
Aug 1, 2018
11fd5f6
updating log message
Aug 5, 2018
87a9cf8
Remove Log Processing from fluentd configuration
r-dilip Aug 16, 2018
308be41
Remove plugin references from base_container.data
r-dilip Aug 16, 2018
5bee0af
Merge pull request #124 from Microsoft/dilipr/fluentdConfigUpdates
r-dilip Aug 30, 2018
bcd1a3f
Dilipr/fluent bit log processing (#126)
r-dilip Sep 14, 2018
b02f2ec
Dilipr/glide updates (#127)
r-dilip Sep 14, 2018
e01c678
containerID="" for pull issues
vishiy Sep 17, 2018
b0ba22d
Using KubeAPI for getting image,name. Adding more logs (#129)
r-dilip Sep 18, 2018
9783419
Dilipr/mark comments (#130)
r-dilip Sep 27, 2018
8e35b73
Rashmi/segfault latest (#132)
rashmichandrashekar Sep 27, 2018
4b63021
Adding a missed null check (#135)
rashmichandrashekar Sep 27, 2018
8b964fd
reusing some variables (#136)
rashmichandrashekar Sep 28, 2018
938c2ed
Rashmi/cjson delete null check (#138)
rashmichandrashekar Sep 28, 2018
fbfdf11
updating log level to debug for some provider workflows (#139)
rashmichandrashekar Oct 3, 2018
d426066
Fixing CPU Utilization and removing Fluent-bit filters (#140)
r-dilip Oct 4, 2018
c2cabab
Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. C…
r-dilip Oct 9, 2018
32567db
* Change FluentBit flush interval to 30 secs (from 5 secs)
vishiy Oct 10, 2018
afc981d
Container Log Telemetry
r-dilip Oct 12, 2018
4b958dd
Fixing an issue with Send Init Event if Telemetry is not initialized …
r-dilip Oct 12, 2018
510ef9f
PR feedback
r-dilip Oct 12, 2018
684c39b
PR feedback
r-dilip Oct 12, 2018
e165275
Sending an event every 5 mins(Heartbeat) (#146)
r-dilip Oct 15, 2018
eecb5db
Merge branch 'ci_feature_prod' into ci_feature
vishiy Oct 16, 2018
cfe1ca9
PR feedback to cleanup removed workflows
vishiy Oct 16, 2018
892b51c
updating agent version for telemetry
vishiy Oct 16, 2018
9c83160
updating agent version
vishiy Oct 17, 2018
f0b5a61
Telemetry Updates (#149)
r-dilip Oct 25, 2018
a58998e
Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159)
r-dilip Oct 30, 2018
4c2da9f
Rashmi/fluentd docker inventory (#160)
rashmichandrashekar Nov 5, 2018
6698fcd
Fix Telemetry Bug -- Initialize Telemetry Client after Initializing a…
r-dilip Nov 8, 2018
ad6bb93
Fix kube events memory leak due to yaml serialization for > 5k events…
vishiy Nov 12, 2018
eff92df
Setting Timeout for HTTP Client in PostDataHelper in outoms go plugi…
r-dilip Nov 14, 2018
9893e36
Vishwa/perftelemetry 2 (#165)
vishiy Nov 16, 2018
4f3c898
environment variable fix (#166)
rashmichandrashekar Nov 27, 2018
5e16467
Fixing a bug where we were crashing due to container statuses not pre…
vishiy Nov 27, 2018
b482b1e
Updating title
vishiy Nov 29, 2018
d75ba89
updating right versions for last release
vishiy Nov 29, 2018
cbd815c
Updating the break condition to look for end of response (#168)
rashmichandrashekar Nov 29, 2018
d0d5bf7
updating AgentVersion for telemetry
vishiy Nov 29, 2018
bfe27e5
Updating readme for latest release changes
vishiy Nov 29, 2018
5677560
Merge branch 'ci_feature_prod' into ci_feature
vishiy Nov 29, 2018
a621f88
Changes - (#173)
vishiy Dec 17, 2018
c9cf4fd
Rashmi/kubenodeinventory (#174)
rashmichandrashekar Dec 17, 2018
df6f122
Get cpuusage from usageseconds (#175)
vishiy Dec 20, 2018
dac9931
Rashmi/kubenodeinventory (#176)
rashmichandrashekar Dec 21, 2018
04cc1a8
Rashmi/kubenodeinventory (#178)
rashmichandrashekar Dec 26, 2018
5883f53
Fixing an issue on the cpurate metric, which happens for the first ti…
vishiy Dec 26, 2018
191f328
Rashmi/kubenodeinventory (#180)
rashmichandrashekar Dec 28, 2018
7e52e8c
Exclude docker containers from container inventory (#181)
rashmichandrashekar Jan 7, 2019
f0591f9
Exclude pauseamd64 containers from container inventory (#182)
rashmichandrashekar Jan 8, 2019
99e8813
Merge branch 'ci_feature_prod' into ci_feature
vishiy Jan 9, 2019
4782435
Update agent version
vishiy Jan 9, 2019
23bcc41
Updating readme for the latest release
vishiy Jan 9, 2019
51d5e93
Fix indentation in kube.conf and update readme (#184)
rashmichandrashekar Jan 11, 2019
decf86a
updating agent tag
rashmichandrashekar Jan 11, 2019
a1b35db
Get Pods for current Node Only (#185)
r-dilip Jan 29, 2019
22649ba
changes for container node inventory fixed type (#186)
rashmichandrashekar Jan 30, 2019
61e2eaf
Fix for mooncake (disable telemetry optionally) (#191)
vishiy Feb 13, 2019
30dff41
CustomMetrics to ci_feature (#193)
r-dilip Feb 15, 2019
f1b0cd2
add ContainerNotRunning column to KubePodInventory
bragi92 Jan 24, 2019
616a803
merge pr feedback: update name to ContainerStatusReason
bragi92 Jan 24, 2019
c33ca34
Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kuber…
r-dilip Feb 19, 2019
2651750
No Retries for non 404 4xx errors (#196)
r-dilip Feb 20, 2019
195bc33
Update agent version for telemetry
vishiy Feb 20, 2019
59d6c61
Update readme for upcoming (ciprod01202019) release
vishiy Feb 20, 2019
0189bc0
fix readme formatting
vishiy Feb 20, 2019
8221d2d
fix formatting for readme
vishiy Feb 20, 2019
30aa305
fix formatting for readme
vishiy Feb 20, 2019
f401116
fix readme
vishiy Feb 20, 2019
a2f45af
fix readme
vishiy Feb 21, 2019
759dbb5
fix agent version for telemetry
vishiy Feb 21, 2019
8bff5f9
Merge branch 'ci_feature_prod' into ci_feature
vishiy Feb 21, 2019
7956f40
fix date in readme
vishiy Feb 21, 2019
ee05656
update readme
vishiy Feb 21, 2019
2abcf67
Restart logs every 10MB instead of weekly (#198)
r-dilip Feb 21, 2019
18c107c
update agent version for telemetry
vishiy Feb 21, 2019
14b2b87
update readme
vishiy Feb 21, 2019
a1b551f
Merge branch 'ci_feature_prod' into ci_feature
vishiy Feb 21, 2019
5479dff
Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path
rashmichandrashekar Feb 22, 2019
cdded2e
Fix AKSEngine Crash (#200)
r-dilip Mar 4, 2019
57be1c4
hotfix
vishiy Mar 13, 2019
940a6eb
fix readme for new version
vishiy Mar 13, 2019
154fe56
Merge branch 'ci_feature_prod' into ci_feature
vishiy Mar 13, 2019
4115824
Fix the pod count in mdm agent plugin (#203)
r-dilip Mar 13, 2019
df2e64c
Update readme
vishiy Mar 13, 2019
cb90658
Merge branch 'ci_feature_prod' into ci_feature
vishiy Mar 13, 2019
19c2bc7
string freeze for out_mdm plugin
vishiy Mar 13, 2019
69935b3
Vishwa/resourcecentric (#208)
vishiy Apr 1, 2019
6953f50
Rashmi/win nodepool - PR (#206)
rashmichandrashekar Apr 1, 2019
ebdd8cc
adding os to container inventory for windows nodes (#210)
rashmichandrashekar Apr 8, 2019
d7b8cff
Fix omsagent crash Error when kube-api returns non-200, send events f…
r-dilip Apr 8, 2019
c9bb623
updating to lowercase compare for units (#212)
rashmichandrashekar Apr 10, 2019
3a88db8
Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214)
vishiy Apr 16, 2019
8cdf724
Fix telemetry error for telegraf err count metric (#215)
vishiy Apr 18, 2019
d2d5f0e
Merge branch 'ci_feature_prod' into ci_feature
vishiy Apr 18, 2019
36c8037
Fix Unscheduled Pod bug, remove excess telemetry (#218)
r-dilip May 31, 2019
803f934
Merge from Vishwa/promstandardmetrics into ci_feature (#220)
vishiy Jun 6, 2019
afc66b7
merge config/settings to ci_feature (#221)
vishiy Jun 6, 2019
727d5bd
Fix Scenario when Controller name is empty (#222)
r-dilip Jun 6, 2019
5e4b0f3
fix ;
vishiy Jun 7, 2019
6fefcac
ContainerLog collection optimizations (#223)
vishiy Jun 8, 2019
f87349e
merge final changes for release from Vishwa/june2019agentrel to ci_f…
vishiy Jun 10, 2019
195f82b
Merge branch 'ci_feature_prod' into ci_feature
vishiy Jun 10, 2019
8a412c1
fix fluent bit tuning for perf run (#226)
vishiy Jun 14, 2019
f613f2a
Merge branch 'ci_feature_prod' into ci_feature
vishiy Jun 14, 2019
e36b5ab
fix merge issue
vishiy Jun 14, 2019
8ba1f86
add release notes for june release in ci_feature branch
rashmichandrashekar Jun 21, 2019
e7e9e6d
fix title
rashmichandrashekar Jun 21, 2019
3903a9d
update
rashmichandrashekar Jun 21, 2019
f5b54fe
fix title
rashmichandrashekar Jun 21, 2019
1d32cec
Trim spaces in AKS_REGION (#233)
r-dilip Jul 5, 2019
5b8c52e
Add Logs Size To Telemetry (#234)
r-dilip Jul 9, 2019
5fc0f1b
Merge Vishwa/promcustommetrics to ci_feature (#237)
rashmichandrashekar Jul 9, 2019
5ab1944
Merge branch 'ci_feature_prod' into ci_feature
rashmichandrashekar Jul 9, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Note : The agent version(s) below has dates (ciprod<mmddyyyy>), which indicate t
* Replica set memory request by 75M (100M to 175M)
* Daemonset CPU request by 25m (50m to 75m)
- Will be pushing image only to MCR ( no more Docker) starting this release. AKS-engine will also start to pull our agent image from MCR

### 04/23/2019 -
##### Version microsoft/oms:ciprod043232019 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod04232019
- Windows node monitoring (metrics & inventory)
Expand Down
10 changes: 0 additions & 10 deletions installer/conf/td-agent-bit-rs.conf
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,6 @@
Parsers_File /etc/td-agent-bit/parsers.conf
Log_File /var/opt/microsoft/docker-cimprov/log/fluent-bit.log

[INPUT]
Name tail
Tag oms.container.log.telegraf.err.*
Path /var/opt/microsoft/docker-cimprov/log/telegraf.log
DB /var/opt/microsoft/docker-cimprov/state/telegraf-log-state.db
Mem_Buf_Limit 2m
Path_Key filepath
Skip_Long_Lines On
Ignore_Older 5m

[INPUT]
Name tcp
Tag oms.container.perf.telegraf.*
Expand Down
22 changes: 9 additions & 13 deletions installer/conf/td-agent-bit.conf
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

[INPUT]
Name tail
Tag oms.container.log.*
Tag oms.container.log.la.*
Path ${AZMON_LOG_TAIL_PATH}
DB /var/log/omsagent-fblogs.db
DB.Sync Off
Expand All @@ -32,17 +32,6 @@
Skip_Long_Lines On
Ignore_Older 2m

[INPUT]
Name tail
Tag oms.container.log.telegraf.err.*
Path /var/opt/microsoft/docker-cimprov/log/telegraf.log
DB /var/opt/microsoft/docker-cimprov/state/telegraf-log-state.db
DB.Sync Off
Mem_Buf_Limit 1m
Path_Key filepath
Skip_Long_Lines On
Ignore_Older 2m

[INPUT]
Name tcp
Tag oms.container.perf.telegraf.*
Expand All @@ -53,9 +42,16 @@

[FILTER]
Name grep
Match oms.container.log.*
Match oms.container.log.la.*
Exclude stream ${AZMON_LOG_EXCLUSION_REGEX_PATTERN}

# Exclude prometheus plugin exceptions that might be caused due to invalid config.(Logs which contain - E! [inputs.prometheus])
# Excluding these logs from being sent to AI since it can result in high volume of data in telemetry due to invalid config.
[FILTER]
Name grep
Match oms.container.log.flbplugin.*
Exclude log E! [\[]inputs.prometheus[\]]

[OUTPUT]
Name oms
EnableTelemetry true
Expand Down
95 changes: 69 additions & 26 deletions installer/conf/telegraf-rs.conf
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@
## Run telegraf in quiet mode (error log messages only).
quiet = true
## Specify the log file name. The empty string means to log to stderr.
logfile = "/var/opt/microsoft/docker-cimprov/log/telegraf.log"
logfile = ""

## Override default hostname, if empty use os.Hostname()
#hostname = "placeholder_hostname"
Expand Down Expand Up @@ -536,32 +536,75 @@
#tagexclude = ["AgentVersion","AKS_RESOURCE_ID","ACS_RESOURCE_NAME", "Region", "ClusterName", "ClusterType", "Computer", "ControllerType"]
# [inputs.prometheus.tagpass]

[[inputs.exec]]
## Commands array
interval = "15m"
commands = [
"/opt/microsoft/docker-cimprov/bin/TelegrafTCPErrorTelemetry.sh"
]
#Prometheus Custom Metrics
[[inputs.prometheus]]
interval = "$AZMON_RS_PROM_INTERVAL"

## Timeout for each command to complete.
timeout = "15s"
## An array of urls to scrape metrics from.
urls = $AZMON_RS_PROM_URLS

## An array of Kubernetes services to scrape metrics from.
kubernetes_services = $AZMON_RS_PROM_K8S_SERVICES

## Scrape Kubernetes pods for the following prometheus annotations:
## - prometheus.io/scrape: Enable scraping for this pod
## - prometheus.io/scheme: If the metrics endpoint is secured then you will need to
## set this to `https` & most likely set the tls config.
## - prometheus.io/path: If the metrics path is not /metrics, define it with this annotation.
## - prometheus.io/port: If port is not 9102 use this annotation
monitor_kubernetes_pods = $AZMON_RS_PROM_MONITOR_PODS

## measurement name suffix (for separating different commands)
name_suffix = "_telemetry"
fieldpass = $AZMON_RS_PROM_FIELDPASS
fielddrop = $AZMON_RS_PROM_FIELDDROP

## Data format to consume.
## Each data format has its own unique set of configuration options, read
## more about them here:
## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
data_format = "influx"
#tagexclude = ["hostName"]
[inputs.exec.tags]
AgentVersion = "$AGENT_VERSION"
AKS_RESOURCE_ID = "$TELEMETRY_AKS_RESOURCE_ID"
ACS_RESOURCE_NAME = "$TELEMETRY_ACS_RESOURCE_NAME"
Region = "$TELEMETRY_AKS_REGION"
ClusterName = "$TELEMETRY_CLUSTER_NAME"
ClusterType = "$TELEMETRY_CLUSTER_TYPE"
Computer = "placeholder_hostname"
ControllerType = "$CONTROLLER_TYPE"
metric_version = 2
url_tag = "scrapeUrl"

## Kubernetes config file to create client from.
# kube_config = "/path/to/kubernetes.config"

## Use bearer token for authorization. ('bearer_token' takes priority)
bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
## OR
# bearer_token_string = "abc_123"

## Specify timeout duration for slower prometheus clients (default is 3s)
response_timeout = "15s"

## Optional TLS Config
tls_ca = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
#tls_cert = /path/to/certfile
# tls_key = /path/to/keyfile
## Use TLS but skip chain & host verification
insecure_skip_verify = true
#tagexclude = ["AgentVersion","AKS_RESOURCE_ID","ACS_RESOURCE_NAME", "Region", "ClusterName", "ClusterType", "Computer", "ControllerType"]

# [[inputs.exec]]
# ## Commands array
# interval = "15m"
# commands = [
# "/opt/microsoft/docker-cimprov/bin/TelegrafTCPErrorTelemetry.sh"
# ]

# ## Timeout for each command to complete.
# timeout = "15s"

# ## measurement name suffix (for separating different commands)
# name_suffix = "_telemetry"

# ## Data format to consume.
# ## Each data format has its own unique set of configuration options, read
# ## more about them here:
# ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
# data_format = "influx"
# #tagexclude = ["hostName"]
# [inputs.exec.tags]
# AgentVersion = "$AGENT_VERSION"
# AKS_RESOURCE_ID = "$TELEMETRY_AKS_RESOURCE_ID"
# ACS_RESOURCE_NAME = "$TELEMETRY_ACS_RESOURCE_NAME"
# Region = "$TELEMETRY_AKS_REGION"
# ClusterName = "$TELEMETRY_CLUSTER_NAME"
# ClusterType = "$TELEMETRY_CLUSTER_TYPE"
# Computer = "placeholder_hostname"
# ControllerType = "$CONTROLLER_TYPE"

88 changes: 61 additions & 27 deletions installer/conf/telegraf.conf
Original file line number Diff line number Diff line change
Expand Up @@ -77,8 +77,7 @@
## Run telegraf in quiet mode (error log messages only).
quiet = true
## Specify the log file name. The empty string means to log to stderr.
logfile = "/var/opt/microsoft/docker-cimprov/log/telegraf.log"

logfile = ""
## Override default hostname, if empty use os.Hostname()
#hostname = "placeholder_hostname"
## If set to true, do no set the "host" tag in the telegraf agent.
Expand Down Expand Up @@ -568,31 +567,66 @@
insecure_skip_verify = true
#tagexclude = ["AgentVersion","AKS_RESOURCE_ID","ACS_RESOURCE_NAME", "Region", "ClusterName", "ClusterType", "Computer", "ControllerType"]

[[inputs.exec]]
## Commands array
interval = "15m"
commands = [
"/opt/microsoft/docker-cimprov/bin/TelegrafTCPErrorTelemetry.sh"
]

## Timeout for each command to complete.
timeout = "15s"
## prometheus custom metrics
[[inputs.prometheus]]

## measurement name suffix (for separating different commands)
name_suffix = "_telemetry"
interval = "$AZMON_DS_PROM_INTERVAL"

## Data format to consume.
## Each data format has its own unique set of configuration options, read
## more about them here:
## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
data_format = "influx"
tagexclude = ["hostName"]
[inputs.exec.tags]
AgentVersion = "$AGENT_VERSION"
AKS_RESOURCE_ID = "$TELEMETRY_AKS_RESOURCE_ID"
ACS_RESOURCE_NAME = "$TELEMETRY_ACS_RESOURCE_NAME"
Region = "$TELEMETRY_AKS_REGION"
ClusterName = "$TELEMETRY_CLUSTER_NAME"
ClusterType = "$TELEMETRY_CLUSTER_TYPE"
Computer = "placeholder_hostname"
ControllerType = "$CONTROLLER_TYPE"
## An array of urls to scrape metrics from.
urls = $AZMON_DS_PROM_URLS

fieldpass = $AZMON_DS_PROM_FIELDPASS

fielddrop = $AZMON_DS_PROM_FIELDDROP

metric_version = 2
url_tag = "scrapeUrl"

## Kubernetes config file to create client from.
# kube_config = "/path/to/kubernetes.config"

## Use bearer token for authorization. ('bearer_token' takes priority)
bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
## OR
# bearer_token_string = "abc_123"

## Specify timeout duration for slower prometheus clients (default is 3s)
response_timeout = "15s"

## Optional TLS Config
tls_ca = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
#tls_cert = /path/to/certfile
# tls_key = /path/to/keyfile
## Use TLS but skip chain & host verification
insecure_skip_verify = true
#tagexclude = ["AgentVersion","AKS_RESOURCE_ID","ACS_RESOURCE_NAME", "Region", "ClusterName", "ClusterType", "Computer", "ControllerType"]

# [[inputs.exec]]
# ## Commands array
# interval = "15m"
# commands = [
# "/opt/microsoft/docker-cimprov/bin/TelegrafTCPErrorTelemetry.sh"
# ]

# ## Timeout for each command to complete.
# timeout = "15s"

# ## measurement name suffix (for separating different commands)
# name_suffix = "_telemetry"

# ## Data format to consume.
# ## Each data format has its own unique set of configuration options, read
# ## more about them here:
# ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
# data_format = "influx"
# tagexclude = ["hostName"]
# [inputs.exec.tags]
# AgentVersion = "$AGENT_VERSION"
# AKS_RESOURCE_ID = "$TELEMETRY_AKS_RESOURCE_ID"
# ACS_RESOURCE_NAME = "$TELEMETRY_ACS_RESOURCE_NAME"
# Region = "$TELEMETRY_AKS_REGION"
# ClusterName = "$TELEMETRY_CLUSTER_NAME"
# ClusterType = "$TELEMETRY_CLUSTER_TYPE"
# Computer = "placeholder_hostname"
# ControllerType = "$CONTROLLER_TYPE"
3 changes: 2 additions & 1 deletion installer/datafiles/base_container.data
Original file line number Diff line number Diff line change
Expand Up @@ -110,9 +110,10 @@ MAINTAINER: 'Microsoft Corporation'
/etc/opt/microsoft/docker-cimprov/out_oms.conf; installer/conf/out_oms.conf; 644; root; root
/etc/opt/microsoft/docker-cimprov/telegraf.conf; installer/conf/telegraf.conf; 644; root; root
/etc/opt/microsoft/docker-cimprov/telegraf-rs.conf; installer/conf/telegraf-rs.conf; 644; root; root
/opt/microsoft/docker-cimprov/bin/TelegrafTCPErrorTelemetry.sh; installer/scripts/TelegrafTCPErrorTelemetry.sh; 755; root; root
/opt/microsoft/docker-cimprov/bin/TelegrafTCPErrorTelemetry.sh; installer/scripts/TelegrafTCPErrorTelemetry.sh; 755; root; root
/opt/livenessprobe.sh; installer/scripts/livenessprobe.sh; 755; root; root
/opt/tomlparser.rb; installer/scripts/tomlparser.rb; 755; root; root
/opt/tomlparser-prom-customconfig.rb; installer/scripts/tomlparser-prom-customconfig.rb; 755; root; root

%Links
/opt/omi/lib/libcontainer.${{SHLIB_EXT}}; /opt/microsoft/docker-cimprov/lib/libcontainer.${{SHLIB_EXT}}; 644; root; root
Expand Down
Loading