Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
3c5b46d
Updatng release history
vishiy Aug 1, 2018
d31f588
fixing the plugin logs for emit stream
Aug 1, 2018
11fd5f6
updating log message
Aug 5, 2018
87a9cf8
Remove Log Processing from fluentd configuration
r-dilip Aug 16, 2018
308be41
Remove plugin references from base_container.data
r-dilip Aug 16, 2018
5bee0af
Merge pull request #124 from Microsoft/dilipr/fluentdConfigUpdates
r-dilip Aug 30, 2018
bcd1a3f
Dilipr/fluent bit log processing (#126)
r-dilip Sep 14, 2018
b02f2ec
Dilipr/glide updates (#127)
r-dilip Sep 14, 2018
e01c678
containerID="" for pull issues
vishiy Sep 17, 2018
b0ba22d
Using KubeAPI for getting image,name. Adding more logs (#129)
r-dilip Sep 18, 2018
9783419
Dilipr/mark comments (#130)
r-dilip Sep 27, 2018
8e35b73
Rashmi/segfault latest (#132)
rashmichandrashekar Sep 27, 2018
4b63021
Adding a missed null check (#135)
rashmichandrashekar Sep 27, 2018
8b964fd
reusing some variables (#136)
rashmichandrashekar Sep 28, 2018
938c2ed
Rashmi/cjson delete null check (#138)
rashmichandrashekar Sep 28, 2018
fbfdf11
updating log level to debug for some provider workflows (#139)
rashmichandrashekar Oct 3, 2018
d426066
Fixing CPU Utilization and removing Fluent-bit filters (#140)
r-dilip Oct 4, 2018
c2cabab
Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. C…
r-dilip Oct 9, 2018
32567db
* Change FluentBit flush interval to 30 secs (from 5 secs)
vishiy Oct 10, 2018
afc981d
Container Log Telemetry
r-dilip Oct 12, 2018
4b958dd
Fixing an issue with Send Init Event if Telemetry is not initialized …
r-dilip Oct 12, 2018
510ef9f
PR feedback
r-dilip Oct 12, 2018
684c39b
PR feedback
r-dilip Oct 12, 2018
e165275
Sending an event every 5 mins(Heartbeat) (#146)
r-dilip Oct 15, 2018
eecb5db
Merge branch 'ci_feature_prod' into ci_feature
vishiy Oct 16, 2018
cfe1ca9
PR feedback to cleanup removed workflows
vishiy Oct 16, 2018
892b51c
updating agent version for telemetry
vishiy Oct 16, 2018
9c83160
updating agent version
vishiy Oct 17, 2018
f0b5a61
Telemetry Updates (#149)
r-dilip Oct 25, 2018
a58998e
Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159)
r-dilip Oct 30, 2018
4c2da9f
Rashmi/fluentd docker inventory (#160)
rashmichandrashekar Nov 5, 2018
6698fcd
Fix Telemetry Bug -- Initialize Telemetry Client after Initializing a…
r-dilip Nov 8, 2018
ad6bb93
Fix kube events memory leak due to yaml serialization for > 5k events…
vishiy Nov 12, 2018
eff92df
Setting Timeout for HTTP Client in PostDataHelper in outoms go plugi…
r-dilip Nov 14, 2018
9893e36
Vishwa/perftelemetry 2 (#165)
vishiy Nov 16, 2018
4f3c898
environment variable fix (#166)
rashmichandrashekar Nov 27, 2018
5e16467
Fixing a bug where we were crashing due to container statuses not pre…
vishiy Nov 27, 2018
b482b1e
Updating title
vishiy Nov 29, 2018
d75ba89
updating right versions for last release
vishiy Nov 29, 2018
cbd815c
Updating the break condition to look for end of response (#168)
rashmichandrashekar Nov 29, 2018
d0d5bf7
updating AgentVersion for telemetry
vishiy Nov 29, 2018
bfe27e5
Updating readme for latest release changes
vishiy Nov 29, 2018
5677560
Merge branch 'ci_feature_prod' into ci_feature
vishiy Nov 29, 2018
a621f88
Changes - (#173)
vishiy Dec 17, 2018
c9cf4fd
Rashmi/kubenodeinventory (#174)
rashmichandrashekar Dec 17, 2018
df6f122
Get cpuusage from usageseconds (#175)
vishiy Dec 20, 2018
dac9931
Rashmi/kubenodeinventory (#176)
rashmichandrashekar Dec 21, 2018
04cc1a8
Rashmi/kubenodeinventory (#178)
rashmichandrashekar Dec 26, 2018
5883f53
Fixing an issue on the cpurate metric, which happens for the first ti…
vishiy Dec 26, 2018
191f328
Rashmi/kubenodeinventory (#180)
rashmichandrashekar Dec 28, 2018
7e52e8c
Exclude docker containers from container inventory (#181)
rashmichandrashekar Jan 7, 2019
f0591f9
Exclude pauseamd64 containers from container inventory (#182)
rashmichandrashekar Jan 8, 2019
99e8813
Merge branch 'ci_feature_prod' into ci_feature
vishiy Jan 9, 2019
4782435
Update agent version
vishiy Jan 9, 2019
23bcc41
Updating readme for the latest release
vishiy Jan 9, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,31 @@ information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeo
additional questions or comments.

## Release History

Note : The agent version(s) below has dates (ciprod<mmddyyyy>), which indicate the agent build dates (not release dates)

### 10/09/2018 - Version microsoft/oms:ciprod01092019
- Omsagent - 1.8.1.256 (nov 2018 release)
- Persist fluentbit state between container restarts
- Populate 'TimeOfCommand' for agent ingest time for container logs
- Get node cpu usage from cpuusagenanoseconds (and convert to cpuusgaenanocores)
- Container Node Inventory - move to fluentD from OMI
- Mount docker.sock (Daemon set) as /var/run/host
- Liveness probe (Daemon set) - check for omsagent user permissions in docker.sock and update as necessary (required when docker daemon gets restarted)
- Move to fixed type for kubeevents & kubeservices
- Disable collecting ENV for our oms agent container (daemonset & replicaset)
- Disable container inventory collection for 'sandbox' containers & non kubernetes managed containers
- Agent telemetry - ContainerLogsAgentSideLatencyMs
- Agent telemetry - PodCount
- Agent telemetry - ControllerCount
- Agent telemetry - K8S Version
- Agent telemetry - NodeCoreCapacity
- Agent telemetry - NodeMemoryCapacity
- Agent telemetry - KubeEvents (exceptions)
- Agent telemetry - Kubenodes (exceptions)
- Agent telemetry - kubepods (exceptions)
- Agent telemetry - kubeservices (exceptions)
- Agent telemetry - Daemonset , Replicaset as dimensions (bug fix)

### 11/29/2018 - Version microsoft/oms:ciprod11292018
- Disable Container Image inventory workflow
Expand Down
23 changes: 0 additions & 23 deletions installer/conf/container.conf
Original file line number Diff line number Diff line change
Expand Up @@ -15,16 +15,6 @@
log_level debug
</source>

# Container host inventory
<source>
type omi
run_interval 60s
tag oms.api.ContainerNodeInventory
items [
["root/cimv2","Container_HostInventory"]
]
</source>

#cadvisor perf
<source>
type cadvisorperf
Expand All @@ -33,19 +23,6 @@
log_level debug
</source>

<match oms.api.ContainerNodeInventory**>
type out_oms_api
log_level debug
buffer_chunk_limit 20m
buffer_type file
buffer_path %STATE_DIR_WS%/out_oms_containernodeinventory*.buffer
buffer_queue_limit 20
flush_interval 20s
retry_limit 10
retry_wait 15s
max_retry_wait 9m
</match>

<match oms.containerinsights.containerinventory**>
type out_oms
log_level debug
Expand Down
32 changes: 23 additions & 9 deletions installer/conf/kube.conf
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
#Kubernetes events
<source>
type kubeevents
tag oms.api.KubeEvents.CollectionTime
tag oms.containerinsights.KubeEvents
run_interval 60s
log_level debug
</source>
Expand All @@ -26,7 +26,7 @@
#Kubernetes services
<source>
type kubeservices
tag oms.api.KubeServices.CollectionTime
tag oms.containerinsights.KubeServices
run_interval 60s
log_level debug
</source>
Expand Down Expand Up @@ -62,18 +62,19 @@
max_retry_wait 9m
</match>

<match oms.api.KubeEvents**>
type out_oms_api
<match oms.containerinsights.KubeEvents**>
type out_oms
log_level debug
num_threads 5
num_threads 5
buffer_chunk_limit 5m
buffer_type file
buffer_path %STATE_DIR_WS%/out_oms_api_kubeevents*.buffer
buffer_path %STATE_DIR_WS%/out_oms_kubeevents*.buffer
buffer_queue_limit 10
buffer_queue_full_action drop_oldest_chunk
buffer_queue_full_action drop_oldest_chunk
flush_interval 20s
retry_limit 10
retry_wait 30s
max_retry_wait 9m
</match>

<match oms.api.KubeLogs**>
Expand All @@ -88,8 +89,8 @@
retry_wait 30s
</match>

<match oms.api.KubeServices**>
type out_oms_api
<match oms.containerinsights.KubeServices**>
type out_oms
log_level debug
num_threads 5
buffer_chunk_limit 20m
Expand Down Expand Up @@ -118,6 +119,19 @@
max_retry_wait 9m
</match>

<match oms.api.ContainerNodeInventory**>
type out_oms_api
log_level debug
buffer_chunk_limit 20m
buffer_type file
buffer_path %STATE_DIR_WS%/out_oms_containernodeinventory*.buffer
buffer_queue_limit 20
flush_interval 20s
retry_limit 10
retry_wait 15s
max_retry_wait 9m
</match>

<match oms.api.KubePerf**>
type out_oms
log_level debug
Expand Down
4 changes: 2 additions & 2 deletions installer/conf/td-agent-bit.conf
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
Name tail
Tag oms.container.log.*
Path /var/log/containers/*.log
DB /var/opt/microsoft/docker-cimprov/state/fblogs.db
DB /var/log/omsagent-fblogs.db
Parser docker
Mem_Buf_Limit 30m
Path_Key filepath
Expand All @@ -28,5 +28,5 @@
EnableTelemetry true
TelemetryPushIntervalSeconds 300
Match oms.container.log.*
AgentVersion ciprod11292018
AgentVersion ciprod01092019

47 changes: 36 additions & 11 deletions source/code/go/src/plugins/oms.go
Original file line number Diff line number Diff line change
Expand Up @@ -77,9 +77,10 @@ var (

// DataItem represents the object corresponding to the json that is sent by fluentbit tail plugin
type DataItem struct {
LogEntry string `json:"LogEntry"`
LogEntrySource string `json:"LogEntrySource"`
LogEntryTimeStamp string `json:"LogEntryTimeStamp"`
LogEntry string `json:"LogEntry"`
LogEntrySource string `json:"LogEntrySource"`
LogEntryTimeStamp string `json:"LogEntryTimeStamp"`
LogEntryTimeOfCommand string `json:"TimeOfCommand"`
ID string `json:"Id"`
Image string `json:"Image"`
Name string `json:"Name"`
Expand Down Expand Up @@ -204,6 +205,10 @@ func PostDataHelper(tailPluginRecords []map[interface{}]interface{}) int {

start := time.Now()
var dataItems []DataItem

var maxLatency float64
var maxLatencyContainer string

ignoreIDSet := make(map[string]bool)
imageIDMap := make(map[string]string)
nameIDMap := make(map[string]string)
Expand Down Expand Up @@ -248,18 +253,32 @@ func PostDataHelper(tailPluginRecords []map[interface{}]interface{}) int {
Log("ContainerId %s not present in Map ", containerID)
}


dataItem := DataItem{
ID: stringMap["Id"],
LogEntry: stringMap["LogEntry"],
LogEntrySource: stringMap["LogEntrySource"],
LogEntryTimeStamp: stringMap["LogEntryTimeStamp"],
SourceSystem: stringMap["SourceSystem"],
Computer: Computer,
Image: stringMap["Image"],
Name: stringMap["Name"],
ID: stringMap["Id"],
LogEntry: stringMap["LogEntry"],
LogEntrySource: stringMap["LogEntrySource"],
LogEntryTimeStamp: stringMap["LogEntryTimeStamp"],
LogEntryTimeOfCommand: start.Format(time.RFC3339),
SourceSystem: stringMap["SourceSystem"],
Computer: Computer,
Image: stringMap["Image"],
Name: stringMap["Name"],
}

dataItems = append(dataItems, dataItem)
loggedTime, e := time.Parse(time.RFC3339, dataItem.LogEntryTimeStamp)
if e!= nil {
message := fmt.Sprintf("Error while converting LogEntryTimeStamp for telemetry purposes: %s", e.Error())
Log(message)
SendException(message)
} else {
ltncy := float64(start.Sub(loggedTime) / time.Millisecond)
if ltncy >= maxLatency {
maxLatency = ltncy
maxLatencyContainer = dataItem.Name + "=" + dataItem.ID
}
}
}

if len(dataItems) > 0 {
Expand Down Expand Up @@ -302,6 +321,12 @@ func PostDataHelper(tailPluginRecords []map[interface{}]interface{}) int {
ContainerLogTelemetryMutex.Lock()
FlushedRecordsCount += float64(numRecords)
FlushedRecordsTimeTaken += float64(elapsed / time.Millisecond)

if maxLatency >= AgentLogProcessingMaxLatencyMs {
AgentLogProcessingMaxLatencyMs = maxLatency
AgentLogProcessingMaxLatencyMsContainer = maxLatencyContainer
}

ContainerLogTelemetryMutex.Unlock()
}

Expand Down
13 changes: 13 additions & 0 deletions source/code/go/src/plugins/telemetry.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@ var (
FlushedRecordsCount float64
// FlushedRecordsTimeTaken indicates the cumulative time taken to flush the records for the current period
FlushedRecordsTimeTaken float64
// This is telemetry for how old/latent logs we are processing in milliseconds (max over a period of time)
AgentLogProcessingMaxLatencyMs float64
// This is telemetry for which container logs were latent (max over a period of time)
AgentLogProcessingMaxLatencyMsContainer string
// CommonProperties indicates the dimensions that are sent with every event/metric
CommonProperties map[string]string
// TelemetryClient is the client used to send the telemetry
Expand All @@ -35,6 +39,8 @@ const (
envAppInsightsAuth = "APPLICATIONINSIGHTS_AUTH"
metricNameAvgFlushRate = "ContainerLogAvgRecordsFlushedPerSec"
metricNameAvgLogGenerationRate = "ContainerLogsGeneratedPerSec"
metricNameAgentLogProcessingMaxLatencyMs = "ContainerLogsAgentSideLatencyMs"

defaultTelemetryPushIntervalSeconds = 300

eventNameContainerLogInit = "ContainerLogPluginInitialized"
Expand Down Expand Up @@ -62,12 +68,19 @@ func SendContainerLogPluginMetrics(telemetryPushIntervalProperty string) {
logRate := FlushedRecordsCount / float64(elapsed/time.Second)
FlushedRecordsCount = 0.0
FlushedRecordsTimeTaken = 0.0
logLatencyMs := AgentLogProcessingMaxLatencyMs
logLatencyMsContainer := AgentLogProcessingMaxLatencyMsContainer
AgentLogProcessingMaxLatencyMs = 0
AgentLogProcessingMaxLatencyMsContainer = ""
ContainerLogTelemetryMutex.Unlock()

flushRateMetric := appinsights.NewMetricTelemetry(metricNameAvgFlushRate, flushRate)
TelemetryClient.Track(flushRateMetric)
logRateMetric := appinsights.NewMetricTelemetry(metricNameAvgLogGenerationRate, logRate)
TelemetryClient.Track(logRateMetric)
logLatencyMetric := appinsights.NewMetricTelemetry(metricNameAgentLogProcessingMaxLatencyMs, logLatencyMs)
logLatencyMetric.Properties["Container"] = logLatencyMsContainer
TelemetryClient.Track(logLatencyMetric)
start = time.Now()
}
}
Expand Down
36 changes: 25 additions & 11 deletions source/code/plugin/ApplicationInsightsUtility.rb
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,13 @@ class ApplicationInsightsUtility
@@Exception = 'ExceptionEvent'
@@AcsClusterType = 'ACS'
@@AksClusterType = 'AKS'
@@DaemonsetControllerType = 'DaemonSet'
@OmsAdminFilePath = '/etc/opt/microsoft/omsagent/conf/omsadmin.conf'
@@EnvAcsResourceName = 'ACS_RESOURCE_NAME'
@@EnvAksRegion = 'AKS_REGION'
@@EnvAgentVersion = 'AGENT_VERSION'
@@EnvApplicationInsightsKey = 'APPLICATIONINSIGHTS_AUTH'
@@EnvControllerType = 'CONTROLLER_TYPE'

@@CustomProperties = {}
@@Tc = nil
@@hostName = (OMS::Common.get_hostname)
Expand Down Expand Up @@ -54,12 +55,11 @@ def initializeUtility()
@@CustomProperties["ClusterName"] = clusterName
@@CustomProperties["Region"] = ENV[@@EnvAksRegion]
end
@@CustomProperties['ControllerType'] = @@DaemonsetControllerType
dockerInfo = DockerApiClient.dockerInfo
@@CustomProperties['DockerVersion'] = dockerInfo['Version']
@@CustomProperties['DockerApiVersion'] = dockerInfo['ApiVersion']

getDockerInfo()
@@CustomProperties['WorkspaceID'] = getWorkspaceId
@@CustomProperties['AgentVersion'] = ENV[@@EnvAgentVersion]
@@CustomProperties['ControllerType'] = ENV[@@EnvControllerType]
encodedAppInsightsKey = ENV[@@EnvApplicationInsightsKey]
if !encodedAppInsightsKey.nil?
decodedAppInsightsKey = Base64.decode64(encodedAppInsightsKey)
Expand All @@ -70,6 +70,14 @@ def initializeUtility()
end
end

def getDockerInfo()
dockerInfo = DockerApiClient.dockerInfo
if (!dockerInfo.nil? && !dockerInfo.empty?)
@@CustomProperties['DockerVersion'] = dockerInfo['Version']
@@CustomProperties['DockerApiVersion'] = dockerInfo['ApiVersion']
end
end

def sendHeartBeatEvent(pluginName)
begin
eventName = pluginName + @@HeartBeat
Expand All @@ -83,7 +91,7 @@ def sendHeartBeatEvent(pluginName)
end
end

def sendCustomEvent(pluginName, properties)
def sendCustomMetric(pluginName, properties)
begin
if !(@@Tc.nil?)
@@Tc.track_metric 'LastProcessedContainerInventoryCount', properties['ContainerCount'],
Expand All @@ -93,14 +101,16 @@ def sendCustomEvent(pluginName, properties)
$log.info("AppInsights Container Count Telemetry sent successfully")
end
rescue => errorStr
$log.warn("Exception in AppInsightsUtility: sendCustomEvent - error: #{errorStr}")
$log.warn("Exception in AppInsightsUtility: sendCustomMetric - error: #{errorStr}")
end
end

def sendExceptionTelemetry(errorStr)
begin
if @@CustomProperties.empty? || @@CustomProperties.nil?
initializeUtility
initializeUtility()
elsif @@CustomProperties['DockerVersion'].nil?
getDockerInfo()
end
if !(@@Tc.nil?)
@@Tc.track_exception errorStr , :properties => @@CustomProperties
Expand All @@ -116,11 +126,13 @@ def sendExceptionTelemetry(errorStr)
def sendTelemetry(pluginName, properties)
begin
if @@CustomProperties.empty? || @@CustomProperties.nil?
initializeUtility
initializeUtility()
elsif @@CustomProperties['DockerVersion'].nil?
getDockerInfo()
end
@@CustomProperties['Computer'] = properties['Computer']
sendHeartBeatEvent(pluginName)
sendCustomEvent(pluginName, properties)
sendCustomMetric(pluginName, properties)
rescue => errorStr
$log.warn("Exception in AppInsightsUtility: sendTelemetry - error: #{errorStr}")
end
Expand All @@ -134,7 +146,9 @@ def sendMetricTelemetry(metricName, metricValue, properties)
return
end
if @@CustomProperties.empty? || @@CustomProperties.nil?
initializeUtility
initializeUtility()
elsif @@CustomProperties['DockerVersion'].nil?
getDockerInfo()
end
telemetryProps = {}
telemetryProps["Computer"] = @@hostName
Expand Down
Loading