Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
82d91c2
optimize kpi
ganga1980 Nov 29, 2020
bede6ef
optimize kube node inventory
ganga1980 Nov 30, 2020
9f7759e
add flags for events, deployments and hpa
ganga1980 Nov 30, 2020
6073fed
have separate function parseNodeLimits
ganga1980 Nov 30, 2020
97f55f7
refactor code
ganga1980 Nov 30, 2020
abc28c2
fix crash
ganga1980 Nov 30, 2020
259a95c
fix bug with service name
ganga1980 Nov 30, 2020
b37529b
fix bugs related to get service name
ganga1980 Nov 30, 2020
7375e33
update oom fix test agent
ganga1980 Nov 30, 2020
ed0857b
debug logs
ganga1980 Nov 30, 2020
b69f032
fix service label issue
ganga1980 Nov 30, 2020
2eeaed4
update to latest agent and enable ephemeral annotation
ganga1980 Nov 30, 2020
10e4b71
change stream size to 200 from 250
ganga1980 Nov 30, 2020
d003daa
update yaml
ganga1980 Nov 30, 2020
0ba0610
adjust chunksizes
ganga1980 Dec 1, 2020
43975d9
add ruby gc env
ganga1980 Dec 1, 2020
2b8660b
yaml changes for cioomtest11282020-3
ganga1980 Dec 1, 2020
8e378fa
telemetry to track pods latency
ganga1980 Dec 1, 2020
fb56ab0
service count telemetry
ganga1980 Dec 1, 2020
e9541ea
rename variables
ganga1980 Dec 1, 2020
023a7cb
wip
ganga1980 Dec 1, 2020
26f0772
nodes inventory telemetry
ganga1980 Dec 1, 2020
79f40f1
configmap changes
ganga1980 Dec 2, 2020
3545773
add emit streams in configmap
ganga1980 Dec 2, 2020
9b7587d
yaml updates
ganga1980 Dec 2, 2020
9b857b4
fix copy and paste bug
ganga1980 Dec 2, 2020
5597360
add todo comments
ganga1980 Dec 2, 2020
8880e91
fix node latency telemetry bug
ganga1980 Dec 4, 2020
87f52d6
update yaml with latest test image
ganga1980 Dec 4, 2020
c4651c9
fix bug
ganga1980 Dec 4, 2020
95144a6
upping rs memory change
ganga1980 Dec 4, 2020
ae2cf42
fix mdm bug with final emit stream
ganga1980 Dec 9, 2020
cf8da5c
update to latest image
ganga1980 Dec 9, 2020
11eda7c
fix pr feedback
ganga1980 Dec 9, 2020
2f3574d
fix pr feedback
ganga1980 Dec 10, 2020
6b589a9
rename health config to agent config
ganga1980 Dec 13, 2020
53972c2
fix max allowed hpa chunk size
ganga1980 Dec 13, 2020
f8702ff
update to use 1k pod chunk since validated on 1.18+
ganga1980 Dec 14, 2020
531f768
remove debug logs
ganga1980 Dec 14, 2020
cff2ee4
minor updates
ganga1980 Dec 15, 2020
60d6391
move defaults to common place
ganga1980 Dec 15, 2020
acb2f8f
Merge branch 'ci_dev' into gangams/fix-rs-ooming
ganga1980 Dec 15, 2020
f88ae92
chart updates
ganga1980 Dec 15, 2020
0392e28
final oomfix agent
ganga1980 Dec 15, 2020
6be2e13
update to use prod image so that can be validated with build pipeline
ganga1980 Dec 15, 2020
1c25829
fix typo in comment
ganga1980 Dec 15, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion build/linux/installer/datafiles/base_container.data
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ MAINTAINER: 'Microsoft Corporation'
/opt/tomlparser-mdm-metrics-config.rb; build/linux/installer/scripts/tomlparser-mdm-metrics-config.rb; 755; root; root
/opt/tomlparser-metric-collection-config.rb; build/linux/installer/scripts/tomlparser-metric-collection-config.rb; 755; root; root

/opt/tomlparser-health-config.rb; build/linux/installer/scripts/tomlparser-health-config.rb; 755; root; root
/opt/tomlparser-agent-config.rb; build/linux/installer/scripts/tomlparser-agent-config.rb; 755; root; root
/opt/tomlparser.rb; build/common/installer/scripts/tomlparser.rb; 755; root; root
/opt/td-agent-bit-conf-customizer.rb; build/common/installer/scripts/td-agent-bit-conf-customizer.rb; 755; root; root
/opt/ConfigParseErrorLogger.rb; build/common/installer/scripts/ConfigParseErrorLogger.rb; 755; root; root
Expand Down
172 changes: 172 additions & 0 deletions build/linux/installer/scripts/tomlparser-agent-config.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
#!/usr/local/bin/ruby

#this should be require relative in Linux and require in windows, since it is a gem install on windows
@os_type = ENV["OS_TYPE"]
if !@os_type.nil? && !@os_type.empty? && @os_type.strip.casecmp("windows") == 0
require "tomlrb"
else
require_relative "tomlrb"
end

require_relative "ConfigParseErrorLogger"

@configMapMountPath = "/etc/config/settings/agent-settings"
@configSchemaVersion = ""
@enable_health_model = false

# 250 Node items (15KB per node) account to approximately 4MB
@nodesChunkSize = 250
# 1000 pods (10KB per pod) account to approximately 10MB
@podsChunkSize = 1000
# 4000 events (1KB per event) account to approximately 4MB
@eventsChunkSize = 4000
# roughly each deployment is 8k
# 500 deployments account to approximately 4MB
@deploymentsChunkSize = 500
# roughly each HPA is 3k
# 2000 HPAs account to approximately 6-7MB
@hpaChunkSize = 2000
# stream batch sizes to avoid large file writes
# too low will consume higher disk iops
@podsEmitStreamBatchSize = 200
@nodesEmitStreamBatchSize = 100

# higher the chunk size rs pod memory consumption higher and lower api latency
# similarly lower the value, helps on the memory consumption but incurrs additional round trip latency
# these needs to be tuned be based on the workload
# nodes
@nodesChunkSizeMin = 100
@nodesChunkSizeMax = 400
# pods
@podsChunkSizeMin = 250
@podsChunkSizeMax = 1500
# events
@eventsChunkSizeMin = 2000
@eventsChunkSizeMax = 10000
# deployments
@deploymentsChunkSizeMin = 500
@deploymentsChunkSizeMax = 1000
# hpa
@hpaChunkSizeMin = 500
@hpaChunkSizeMax = 2000

# emit stream sizes to prevent lower values which costs disk i/o
# max will be upto the chunk size
@podsEmitStreamBatchSizeMin = 50
@nodesEmitStreamBatchSizeMin = 50

def is_number?(value)
true if Integer(value) rescue false
end

# Use parser to parse the configmap toml file to a ruby structure
def parseConfigMap
begin
# Check to see if config map is created
if (File.file?(@configMapMountPath))
puts "config::configmap container-azm-ms-agentconfig for agent settings mounted, parsing values"
parsedConfig = Tomlrb.load_file(@configMapMountPath, symbolize_keys: true)
puts "config::Successfully parsed mounted config map"
return parsedConfig
else
puts "config::configmap container-azm-ms-agentconfig for agent settings not mounted, using defaults"
return nil
end
rescue => errorStr
ConfigParseErrorLogger.logError("Exception while parsing config map for agent settings : #{errorStr}, using defaults, please check config map for errors")
return nil
end
end

# Use the ruby structure created after config parsing to set the right values to be used as environment variables
def populateSettingValuesFromConfigMap(parsedConfig)
begin
if !parsedConfig.nil? && !parsedConfig[:agent_settings].nil?
if !parsedConfig[:agent_settings][:health_model].nil? && !parsedConfig[:agent_settings][:health_model][:enabled].nil?
@enable_health_model = parsedConfig[:agent_settings][:health_model][:enabled]
puts "enable_health_model = #{@enable_health_model}"
end
chunk_config = parsedConfig[:agent_settings][:chunk_config]
if !chunk_config.nil?
nodesChunkSize = chunk_config[:NODES_CHUNK_SIZE]
if !nodesChunkSize.nil? && is_number?(nodesChunkSize) && (@nodesChunkSizeMin..@nodesChunkSizeMax) === nodesChunkSize.to_i
@nodesChunkSize = nodesChunkSize.to_i
puts "Using config map value: NODES_CHUNK_SIZE = #{@nodesChunkSize}"
end

podsChunkSize = chunk_config[:PODS_CHUNK_SIZE]
if !podsChunkSize.nil? && is_number?(podsChunkSize) && (@podsChunkSizeMin..@podsChunkSizeMax) === podsChunkSize.to_i
@podsChunkSize = podsChunkSize.to_i
puts "Using config map value: PODS_CHUNK_SIZE = #{@podsChunkSize}"
end

eventsChunkSize = chunk_config[:EVENTS_CHUNK_SIZE]
if !eventsChunkSize.nil? && is_number?(eventsChunkSize) && (@eventsChunkSizeMin..@eventsChunkSizeMax) === eventsChunkSize.to_i
@eventsChunkSize = eventsChunkSize.to_i
puts "Using config map value: EVENTS_CHUNK_SIZE = #{@eventsChunkSize}"
end

deploymentsChunkSize = chunk_config[:DEPLOYMENTS_CHUNK_SIZE]
if !deploymentsChunkSize.nil? && is_number?(deploymentsChunkSize) && (@deploymentsChunkSizeMin..@deploymentsChunkSizeMax) === deploymentsChunkSize.to_i
@deploymentsChunkSize = deploymentsChunkSize.to_i
puts "Using config map value: DEPLOYMENTS_CHUNK_SIZE = #{@deploymentsChunkSize}"
end

hpaChunkSize = chunk_config[:HPA_CHUNK_SIZE]
if !hpaChunkSize.nil? && is_number?(hpaChunkSize) && (@hpaChunkSizeMin..@hpaChunkSizeMax) === hpaChunkSize.to_i
@hpaChunkSize = hpaChunkSize.to_i
puts "Using config map value: HPA_CHUNK_SIZE = #{@hpaChunkSize}"
end

podsEmitStreamBatchSize = chunk_config[:PODS_EMIT_STREAM_BATCH_SIZE]
if !podsEmitStreamBatchSize.nil? && is_number?(podsEmitStreamBatchSize) &&
podsEmitStreamBatchSize.to_i <= @podsChunkSize && podsEmitStreamBatchSize.to_i >= @podsEmitStreamBatchSizeMin
@podsEmitStreamBatchSize = podsEmitStreamBatchSize.to_i
puts "Using config map value: PODS_EMIT_STREAM_BATCH_SIZE = #{@podsEmitStreamBatchSize}"
end
nodesEmitStreamBatchSize = chunk_config[:NODES_EMIT_STREAM_BATCH_SIZE]
if !nodesEmitStreamBatchSize.nil? && is_number?(nodesEmitStreamBatchSize) &&
nodesEmitStreamBatchSize.to_i <= @nodesChunkSize && nodesEmitStreamBatchSize.to_i >= @nodesEmitStreamBatchSizeMin
@nodesEmitStreamBatchSize = nodesEmitStreamBatchSize.to_i
puts "Using config map value: NODES_EMIT_STREAM_BATCH_SIZE = #{@nodesEmitStreamBatchSize}"
end
end
end
rescue => errorStr
puts "config::error:Exception while reading config settings for agent configuration setting - #{errorStr}, using defaults"
@enable_health_model = false
end
end

@configSchemaVersion = ENV["AZMON_AGENT_CFG_SCHEMA_VERSION"]
puts "****************Start Config Processing********************"
if !@configSchemaVersion.nil? && !@configSchemaVersion.empty? && @configSchemaVersion.strip.casecmp("v1") == 0 #note v1 is the only supported schema version , so hardcoding it
configMapSettings = parseConfigMap
if !configMapSettings.nil?
populateSettingValuesFromConfigMap(configMapSettings)
end
else
if (File.file?(@configMapMountPath))
ConfigParseErrorLogger.logError("config::unsupported/missing config schema version - '#{@configSchemaVersion}' , using defaults, please use supported schema version")
end
@enable_health_model = false
end

# Write the settings to file, so that they can be set as environment variables
file = File.open("agent_config_env_var", "w")

if !file.nil?
file.write("export AZMON_CLUSTER_ENABLE_HEALTH_MODEL=#{@enable_health_model}\n")
file.write("export NODES_CHUNK_SIZE=#{@nodesChunkSize}\n")
file.write("export PODS_CHUNK_SIZE=#{@podsChunkSize}\n")
file.write("export EVENTS_CHUNK_SIZE=#{@eventsChunkSize}\n")
file.write("export DEPLOYMENTS_CHUNK_SIZE=#{@deploymentsChunkSize}\n")
file.write("export HPA_CHUNK_SIZE=#{@hpaChunkSize}\n")
file.write("export PODS_EMIT_STREAM_BATCH_SIZE=#{@podsEmitStreamBatchSize}\n")
file.write("export NODES_EMIT_STREAM_BATCH_SIZE=#{@nodesEmitStreamBatchSize}\n")
# Close file after writing all environment variables
file.close
else
puts "Exception while opening file for writing config environment variables"
puts "****************End Config Processing********************"
end
73 changes: 0 additions & 73 deletions build/linux/installer/scripts/tomlparser-health-config.rb

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ data:
<match oms.containerinsights.KubePodInventory**>
type out_oms
log_level debug
num_threads 5
num_threads 2
buffer_chunk_limit 4m
buffer_type file
buffer_path %STATE_DIR_WS%/out_oms_kubepods*.buffer
Expand All @@ -108,24 +108,24 @@ data:
</match>

<match oms.containerinsights.KubePVInventory**>
type out_oms
log_level debug
num_threads 5
buffer_chunk_limit 4m
buffer_type file
buffer_path %STATE_DIR_WS%/state/out_oms_kubepv*.buffer
buffer_queue_limit 20
buffer_queue_full_action drop_oldest_chunk
flush_interval 20s
retry_limit 10
retry_wait 5s
max_retry_wait 5m
type out_oms
log_level debug
num_threads 5
buffer_chunk_limit 4m
buffer_type file
buffer_path %STATE_DIR_WS%/state/out_oms_kubepv*.buffer
buffer_queue_limit 20
buffer_queue_full_action drop_oldest_chunk
flush_interval 20s
retry_limit 10
retry_wait 5s
max_retry_wait 5m
</match>

<match oms.containerinsights.KubeEvents**>
type out_oms
log_level debug
num_threads 5
num_threads 2
buffer_chunk_limit 4m
buffer_type file
buffer_path %STATE_DIR_WS%/out_oms_kubeevents*.buffer
Expand Down Expand Up @@ -155,7 +155,7 @@ data:
<match oms.containerinsights.KubeNodeInventory**>
type out_oms
log_level debug
num_threads 5
num_threads 2
buffer_chunk_limit 4m
buffer_type file
buffer_path %STATE_DIR_WS%/state/out_oms_kubenodes*.buffer
Expand Down Expand Up @@ -184,7 +184,7 @@ data:
<match oms.api.KubePerf**>
type out_oms
log_level debug
num_threads 5
num_threads 2
buffer_chunk_limit 4m
buffer_type file
buffer_path %STATE_DIR_WS%/out_oms_kubeperf*.buffer
Expand Down
9 changes: 9 additions & 0 deletions charts/azuremonitor-containers/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,15 @@ omsagent:
deployment:
affinity:
nodeAffinity:
# affinity to schedule on to ephemeral os node if its available
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: storageprofile
operator: NotIn
values:
- managed
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- labelSelector:
Expand Down
1 change: 1 addition & 0 deletions kubernetes/linux/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ ENV HOST_VAR /hostfs/var
ENV AZMON_COLLECT_ENV False
ENV KUBE_CLIENT_BACKOFF_BASE 1
ENV KUBE_CLIENT_BACKOFF_DURATION 0
ENV RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR 0.9
RUN /usr/bin/apt-get update && /usr/bin/apt-get install -y libc-bin wget openssl curl sudo python-ctypes init-system-helpers net-tools rsyslog cron vim dmidecode apt-transport-https gnupg && rm -rf /var/lib/apt/lists/*
COPY setup.sh main.sh defaultpromenvvariables defaultpromenvvariables-rs mdsd.xml envmdsd $tmpdir/
WORKDIR ${tmpdir}
Expand Down
16 changes: 8 additions & 8 deletions kubernetes/linux/main.sh
Original file line number Diff line number Diff line change
Expand Up @@ -171,14 +171,14 @@ done
source config_env_var


#Parse the configmap to set the right environment variables for health feature.
/opt/microsoft/omsagent/ruby/bin/ruby tomlparser-health-config.rb
#Parse the configmap to set the right environment variables for agent config.
/opt/microsoft/omsagent/ruby/bin/ruby tomlparser-agent-config.rb

cat health_config_env_var | while read line; do
cat agent_config_env_var | while read line; do
#echo $line
echo $line >> ~/.bashrc
done
source health_config_env_var
source agent_config_env_var

#Parse the configmap to set the right environment variables for network policy manager (npm) integration.
/opt/microsoft/omsagent/ruby/bin/ruby tomlparser-npm-config.rb
Expand Down Expand Up @@ -429,7 +429,7 @@ echo "export DOCKER_CIMPROV_VERSION=$DOCKER_CIMPROV_VERSION" >> ~/.bashrc

#region check to auto-activate oneagent, to route container logs,
#Intent is to activate one agent routing for all managed clusters with region in the regionllist, unless overridden by configmap
# AZMON_CONTAINER_LOGS_ROUTE will have route (if any) specified in the config map
# AZMON_CONTAINER_LOGS_ROUTE will have route (if any) specified in the config map
# AZMON_CONTAINER_LOGS_EFFECTIVE_ROUTE will have the final route that we compute & set, based on our region list logic
echo "************start oneagent log routing checks************"
# by default, use configmap route for safer side
Expand Down Expand Up @@ -462,9 +462,9 @@ else
echo "current region is not in oneagent regions..."
fi

if [ "$isoneagentregion" = true ]; then
if [ "$isoneagentregion" = true ]; then
#if configmap has a routing for logs, but current region is in the oneagent region list, take the configmap route
if [ ! -z $AZMON_CONTAINER_LOGS_ROUTE ]; then
if [ ! -z $AZMON_CONTAINER_LOGS_ROUTE ]; then
AZMON_CONTAINER_LOGS_EFFECTIVE_ROUTE=$AZMON_CONTAINER_LOGS_ROUTE
echo "oneagent region is true for current region:$currentregion and config map logs route is not empty. so using config map logs route as effective route:$AZMON_CONTAINER_LOGS_EFFECTIVE_ROUTE"
else #there is no configmap route, so route thru oneagent
Expand Down Expand Up @@ -511,7 +511,7 @@ if [ ! -e "/etc/config/kube.conf" ]; then

echo "starting mdsd ..."
mdsd -l -e ${MDSD_LOG}/mdsd.err -w ${MDSD_LOG}/mdsd.warn -o ${MDSD_LOG}/mdsd.info -q ${MDSD_LOG}/mdsd.qos &

touch /opt/AZMON_CONTAINER_LOGS_EFFECTIVE_ROUTE_V2
fi
fi
Expand Down
Loading