Merge ci_feature to ci_feature_prod#271
Merged
rashmichandrashekar merged 141 commits intoci_feature_prodfrom Oct 7, 2019
Merged
Merge ci_feature to ci_feature_prod#271rashmichandrashekar merged 141 commits intoci_feature_prodfrom
rashmichandrashekar merged 141 commits intoci_feature_prodfrom
Conversation
Dilipr/fluentd config updates
* Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id
* Updating glide.* files to include lumberjack
* Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths
* Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format
* adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson
* adding null check for cjson-delete * null chk * removing null check
Removing fluent-bit filters, CPU optimizations
…ontinue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY
* Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset
…properly, tab to whitespace in conf file
* Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback
* Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback
* updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging
* fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes
… perf testing (#246) Merge Health to ci_feature
…used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation
Added new regions, added handler for MDM plugin start
* Added Missing Handlers
* Adding explicit require_relative
* enable ai telemetry to configure different ikey and endpoint per cloud
…ervice name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model
…theus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes
* add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file
* changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes
* Dupe Perf Record Fix
…itions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268)
* init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes
send when an agg monitor details change, but state did not change
vishiy
approved these changes
Oct 7, 2019
Member
vishiy
left a comment
There was a problem hiding this comment.
approving to be able to resolve conflicts
r-dilip
approved these changes
Oct 7, 2019
vishiy
reviewed
Oct 7, 2019
| #tagexclude = ["AgentVersion","AKS_RESOURCE_ID","ACS_RESOURCE_NAME", "Region", "ClusterName", "ClusterType", "Computer", "ControllerType"] | ||
|
|
||
| [inputs.prometheus.tagpass] | ||
| operation_type = ["create_container", "remove_container", "pull_image"] |
vishiy
approved these changes
Oct 7, 2019
ayusheesingh-zz
pushed a commit
that referenced
this pull request
Jun 27, 2020
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change
jatakiajanvi12
pushed a commit
that referenced
this pull request
Dec 2, 2022
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.