Skip to content

getHostStatus needs thread local storage #7375

@c-taylor

Description

@c-taylor

As parents of a topology are marked down under load, 'HostStatus::getHostStatus' can cause excessive lock behaviour resulting in high system time, reduced output and stats holes.

When performing failure testing: Overloading configured parents causes lock contention on the stats storage.
It was possible to consume almost all ET_NET thread time with a few failing parents and fewer than 5,000 RPS.

Fault replication

Increase load through an edge -> parent configuration until the parents start to fail.
I used connection limits as the failure trigger as it was predictable to fail.

Observations

As parents fail there is an increase in 'HostStatus::getHostStatus' contention, especially when the last parent fails.
This causes a reduction in all 'good' work, errors to clients, content already in cache.

  1. perf traces and flame graphs show near 100% system consumption on lock activity.

getHostStaus_crop

  1. traffic_server metrics stop updating
  2. Response and data rates drop

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions