Skip to content

Dual shipping v2 + Refactor and optimize HTTP compression#10162

Merged
gh123man merged 20 commits intomainfrom
brian/dual-ship-v2
Dec 23, 2021
Merged

Dual shipping v2 + Refactor and optimize HTTP compression#10162
gh123man merged 20 commits intomainfrom
brian/dual-ship-v2

Conversation

@gh123man
Copy link
Copy Markdown
Member

@gh123man gh123man commented Dec 8, 2021

What does this PR do?

  • Allow any number of reliable additional endpoints (previously we only supported 2)
  • Http gzip compression happens once for all destinations (big performance improvement when using multiple endpoints)

Measured improvements:

  • ~10% maximum throughput (bytes/sec) improvement for HTTP endpoints (regardless of configuration)
  • Any additional HTTP endpoints have very little CPU overhead since compression is now done once for all destinations

Summary of changes

To make review easier since this is a large PR, here are the highlights/major areas to look at:

Overview

The pipeline has rearranged:

OLD

Collector -> Decoder -> Processor -> Sender -> Auditor

NEW

Collector -> Decoder -> Processor -> Strategy -> Sender -> Destination -> Auditor 

Here is an informal diagram of the data flow before and after moving things around:

logs-refactor drawio (5)

The sender used to contain the strategy and the destinations. It managed the actual process of encoding and sending logs which was mixed between those 3 pieces. Now they are broken apart with the following responsibilities:

  • Strategy (batch/stream) deals with converting Messages into Payloads as well as encoding the payload (gzip for HTTP)
  • Sender deals only with taking payloads from the strategy and sending them to one or more destinations. This is where dual shipping and additional endpoints are managed.
  • Destinations now have their own read loop and write to the auditor input channel directly (instead of being brokered through the sender).

Notable changes

  • The concept of Main vs additional endpoints is largely removed. There are now only Endpoints that are either is_reliable or not. Reliable endpoints are all treated the same and can block the pipeline.
  • gzip encoding has moved from the HTTP destination into the batch strategy. This is an optimization so that we only have to zip payloads once when we have multiple endpoints or are using concurrency.
  • Concurrent sending of HTTP payloads has been isolated into the HTTP destination. It previously was duplicated in batch_strategy and the http destination to handle async vs non-async sending. (this was required first in order to move gzip compression)
  • Retry logic for both tcp and HTTP have now been isolated in the destination implementations (previously was handled in the sender via errors from the destinations)
  • SendAsync is removed from the destination as now destinations can have retrying toggled. Removes a lot of duplicate code and now all endpoints follow the same codepath.

Config changes

  • Additional endpoints will now use the same compression settings as the main endpoint. This could be a breaking change for users who relied on custom compression settings (though unlikely).
  • Additional endpoints will now use the same retry/backoff settings as the main endpoint. Retries were not previously supported so this is not breaking.

Motivation

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test your changes

Non Agent-core teams

  • please do basic sanity tests to ensure events sent through the pipeline are working as expected

Agent core QA

Local QA

  1. start 2 mock intakes:
docker run --rm -it --name mockhttp1 -p 9998:9998 --rm datadog/agent-dev:bench-logs-intake-dev -b :9998 -d
docker run --rm -it --name mockhttp2 -p 9999:9999 --rm datadog/agent-dev:bench-logs-intake-dev -b :9999 -d

the -d flag puts them in debug mode so they will stream logs to stdout

  1. start an agent:
docker run -d --name dd-agent \
    --net=host \
    -e DD_API_KEY="<API_KEY>"\
    -e DD_LOGS_ENABLED=true \
    -e DD_LOGS_CONFIG_LOGS_NO_SSL=true \
    -e DD_LOGS_CONFIG_USE_HTTP=true \
    -e DD_LOGS_CONFIG_LOGS_DD_URL="localhost:9998" \
    -e DD_LOGS_CONFIG_ADDITIONAL_ENDPOINTS="[{\"api_key\": \"<API_KEY>\", \"Host\": \"localhost\", \"Port\": 9999, \"is_reliable\": true}]" \
    -v /var/run/docker.sock:/var/run/docker.sock:ro \
    -v /proc/:/host/proc/:ro \
    -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
    -v /var/lib/docker/containers:/var/lib/docker/containers:ro \
   <RC_IMAGE>
  1. get some logs flowing through the agent. Warning if you use DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL you will create a feedback loop from the mock intake containers. I'd recommend spawning a dedicated logging container.

  2. Watch the logs flow in both intakes.

  3. Kill one of the intakes

  • make sure logs are still flowing to the other one
  1. Kill the second intake (both are now dead)
  • check agent status to see that # of bytes read is not increasing (pipeline should be blocked)
  1. Restart one of the intakes - logs should start flowing
  2. restart the other intake - logs should start flowing.

App QA

Now instead of using mock intakes - use two real Datadog intakes. You will need 2 API keys each from different orgs.

docker run -d --name dd-agent \
    --net=host \
    -e DD_API_KEY="<API_KEY>"\
    -e DD_LOGS_ENABLED=true \
    -e DD_LOGS_CONFIG_USE_HTTP=true \
    -e DD_LOGS_CONFIG_ADDITIONAL_ENDPOINTS="[{\"api_key\": \"<ANOTHER_API_KEY>\", \"Host\": \"agent-intake.logs.datadoghq.com\", \"Port\": 10516, \"is_reliable\": true}]" \
    -v /var/run/docker.sock:/var/run/docker.sock:ro \
    -v /proc/:/host/proc/:ro \
    -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
    -v /var/lib/docker/containers:/var/lib/docker/containers:ro \
   <RC_IMAGE>

Stream some logs and watch the livetail in both orgs for the logs.

Further testing

For extra validation you can:

  • run more than 2 destinations
  • mix reliable and unreliable destinations
  • make a destination that produces random failures

Reviewer's Checklist

  • If known, an appropriate milestone has been selected; otherwise the Triage milestone is set.
  • The appropriate team/.. label has been applied, if known.
  • A release note has been added or the changelog/no-changelog label has been applied.
  • Changed code has automated tests for its functionality.
  • Adequate QA/testing plan information is provided if the qa/skip-qa label is not applied.
  • If applicable, docs team has been notified or an issue has been opened on the documentation repo.
  • If applicable, the need-change/operator and need-change/helm labels have been applied.
  • If applicable, the config template has been updated.

@gh123man gh123man added this to the 7.34.0 milestone Dec 8, 2021
@gh123man gh123man added the [deprecated] team/agent-core Deprecated. Use metrics-logs / shared-components labels instead.. label Dec 8, 2021
@gh123man gh123man marked this pull request as ready for review December 8, 2021 22:23
@gh123man gh123man requested a review from a team as a code owner December 8, 2021 22:23
@gh123man gh123man requested a review from a team as a code owner December 9, 2021 17:51
Copy link
Copy Markdown
Contributor

@ruthnaebeck ruthnaebeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs review

Comment thread releasenotes/notes/dual_ship_logs-2b5572813045e40f.yaml Outdated
Copy link
Copy Markdown
Contributor

@djmitche djmitche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great PR description and diagrams -- they really helped me to understand this from a high level. Let's make sure that information is captured in the codebase itself, for everyone who reads this code after it's landed!

I have a bunch of notes below, but I think the big "request changes" bit is the races in the sender.

Comment thread pkg/logs/message/message.go
Comment thread pkg/logs/sender/strategy.go
Comment thread pkg/logs/auditor/auditor.go
Comment thread pkg/logs/client/destinations.go Outdated
Comment thread pkg/logs/agent.go
Comment thread pkg/logs/sender/sender.go Outdated
Comment thread pkg/logs/sender/sender.go Outdated
Comment thread pkg/logs/sender/sender.go Outdated
Comment thread pkg/logs/sender/sender.go
Comment thread pkg/logs/sender/stream_strategy.go Outdated
Copy link
Copy Markdown
Contributor

@sgnn7 sgnn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gh123man Great improvements all around! Left a number of comments but they're either nits or questions in general.

Comment thread pkg/epforwarder/epforwarder.go Outdated
Comment thread pkg/epforwarder/epforwarder.go Outdated
Comment thread pkg/epforwarder/epforwarder.go
Comment thread pkg/logs/agent_test.go Outdated
Comment thread pkg/logs/agent_test.go Outdated
Comment thread pkg/logs/config/endpoints.go Outdated
Comment thread pkg/logs/pipeline/pipeline.go Outdated
Comment thread pkg/logs/pipeline/pipeline.go
Comment thread pkg/logs/pipeline/pipeline.go Outdated
Comment thread pkg/logs/sender/batch_strategy.go
Comment thread pkg/logs/client/destination.go Outdated
Copy link
Copy Markdown
Contributor

@ruthnaebeck ruthnaebeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs review

Comment thread pkg/logs/README.md Outdated
Comment thread pkg/logs/README.md Outdated
Comment thread pkg/logs/README.md Outdated
Comment thread pkg/logs/README.md Outdated
Comment thread pkg/logs/README.md Outdated
Comment thread pkg/logs/README.md Outdated
Co-authored-by: ruthnaebeck <19349244+ruthnaebeck@users.noreply.github.com>
@djmitche djmitche self-requested a review December 22, 2021 15:06
Copy link
Copy Markdown
Contributor

@djmitche djmitche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fixes to the sender!

Copy link
Copy Markdown
Contributor

@ruthnaebeck ruthnaebeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for docs

Copy link
Copy Markdown
Contributor

@sgnn7 sgnn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gh123man Just a few nits but LGTM!

Comment thread pkg/logs/client/tcp/destination.go
Comment thread pkg/logs/sender/batch_strategy.go Outdated
Comment thread pkg/logs/sender/sender.go Outdated
@gh123man gh123man merged commit 7f84965 into main Dec 23, 2021
@gh123man gh123man deleted the brian/dual-ship-v2 branch December 23, 2021 20:03
@olivielpeau olivielpeau added the major_change Complex/large change, which significantly modifies agent behavior or could impact many agent teams label Jan 4, 2022
gh123man added a commit that referenced this pull request May 10, 2022
Enable reliable dual shipping by default. Reliable dual shipping is a better default behavior with little performance overhead.
See #10162 for details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

[deprecated] team/agent-core Deprecated. Use metrics-logs / shared-components labels instead.. major_change Complex/large change, which significantly modifies agent behavior or could impact many agent teams team/agent-security team/database-monitoring team/network-device-monitoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants