Skip to content

[networks] Switch back to stream-based API for TCP_CLOSE#9369

Merged
p-lambert merged 5 commits intomainfrom
pedro/closed-connections
Oct 4, 2021
Merged

[networks] Switch back to stream-based API for TCP_CLOSE#9369
p-lambert merged 5 commits intomainfrom
pedro/closed-connections

Conversation

@p-lambert
Copy link
Copy Markdown
Member

@p-lambert p-lambert commented Oct 4, 2021

What does this PR do?

Switch back to stream-based API for consuming TCP closed events.
In a previous PR we changed (among other things) the buffering approach for TCP closed events and moved away from a channel based approach.
The changes were deployed in our load-test environment and we thought there was a noticeable improvement in bytes allocated/s and CPU that was, in part, attributable to this specific change.
However, I failed to notice that a much smaller number of connections was being reported. The issue is that we were enforcing a hard-limit on the number of closed events before they are aggregated (aggregation of TCP close events can happen in the context of high connection churn, as the source port of an ephemeral connection will likely be re-used over a 30s time-frame).
After I identified the issue, I temporarily disabled the limit and the change turned out to result a RSS 20% higher than before.

After putting some thought on the problem I realize there are to fundamental issues with the ConnectionBuffer-based approach:

  • We have to buffer the same Connection twice in memory (one in the tcpCloseConsumer and one in the Tracer;
  • The buffering happens before aggregation, and unless we bring the pre-aggregation close to the tcpCloseConsumer, this results in a much higher number of ConnectionStats objects.

This PR reverts some parts of #9206 while keeping some of the memory allocation improvements. The main changes compared to 7.31 are:

  • The []ConnectionStats slice where closed connections are stored inside network.State is re-used;
  • The tcpCloseConsumer uses a pooled connection.ClosedBatch object;
  • Instead of sending single *ConnectionStats through the channels and calling state.StoreClosedConnection for each connection, we do all operations in batches instead;

Motivation

Fix a bug recently introduced (and not shipped) that results in data-loss while preserving some of the perfomance gain ideas.

Additional Notes

Anything else we should know when reviewing?

Describe how to test your changes

Deploy previous agent-release (7.31) to load test cluster and validate the number of emitted connections is in the same ballpark.

Checklist

  • A release note has been added or the changelog/no-changelog label has been applied.
  • The need-change/operator and need-change/helm labels has been applied if applicable.
  • The appropriate team/.. label has been applied, if known.
  • If known, an appropriate milestone has been selected; otherwise the Triage milestone is set.
  • The config template has been updated if applicable.

Note: Adding GitHub labels is only possible for contributors with write access.

@p-lambert p-lambert force-pushed the pedro/closed-connections branch from 4e7e7bf to b46f802 Compare October 4, 2021 15:49
@p-lambert p-lambert marked this pull request as ready for review October 4, 2021 15:51
@p-lambert p-lambert requested a review from a team as a code owner October 4, 2021 15:51
Comment thread pkg/network/buffer.go Outdated
@p-lambert p-lambert force-pushed the pedro/closed-connections branch from 58e09e3 to 9257051 Compare October 4, 2021 17:12
Comment thread pkg/network/tracer/connection/kprobe/tcp_close_consumer.go
Copy link
Copy Markdown
Member

@brycekahle brycekahle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything we can do to verify these changes have the intended effect and aren't a regression in other ways, would be appreciated.

@p-lambert p-lambert merged commit 1445490 into main Oct 4, 2021
@p-lambert p-lambert deleted the pedro/closed-connections branch October 4, 2021 21:48
zandrewitte pushed a commit to StackVista/stackstate-agent that referenced this pull request Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants