Skip to content

Cleanup Coordinator logs, add duty status API#16959

Merged
kfaraz merged 9 commits intoapache:masterfrom
kfaraz:cleanup_coord_logs
Sep 24, 2024
Merged

Cleanup Coordinator logs, add duty status API#16959
kfaraz merged 9 commits intoapache:masterfrom
kfaraz:cleanup_coord_logs

Conversation

@kfaraz
Copy link
Copy Markdown
Contributor

@kfaraz kfaraz commented Aug 25, 2024

Description

Coordinator logs are fairly noisy and don't give much useful information (see example below).
Even when the Coordinator misbehaves, these logs are not very useful.

Main changes

  • Add API GET /druid/coordinator/v1/duties that returns a status list of all duty groups currently running on the Coordinator
  • Emit metrics segment/poll/time, segment/pollWithSchema/time, segment/buildSnapshot/time
  • Remove redundant logs that indicate normal operation of well-tested aspects of the Coordinator

Refactors

  • Move some logic from DutiesRunnable to CoordinatorDutyGroup
  • Move stats collection from CollectSegmentAndServerStats to PrepareBalancerAndLoadQueues
  • Minor cleanup of class DruidCoordinator
  • Clean up class DruidCoordinatorRuntimeParams
    • Remove field coordinatorStartTime. Maintain start time in MarkOvershadowedSegmentsAsUnused instead.
    • Remove field MetadataRuleManager. Pass supplier to constructor of applicable duties instead.
    • Make usedSegmentsNewestFirst and datasourcesSnapshot as non-nullable as they are always required.

API details

GET /druid/coordinator/v1/duties

{
    "dutyGroups": [
    {
        "name": "HistoricalManagementDuties",
        "period": "PT0.100S",
        "dutyNames": [
            "org.apache.druid.server.coordinator.duty.PrepareBalancerAndLoadQueues",
            "org.apache.druid.server.coordinator.duty.RunRules",
            "org.apache.druid.server.coordinator.DruidCoordinator$UpdateReplicationStatus",
            "org.apache.druid.server.coordinator.DruidCoordinator$CollectSegmentStats",
            "org.apache.druid.server.coordinator.duty.UnloadUnusedSegments",
            "org.apache.druid.server.coordinator.duty.MarkOvershadowedSegmentsAsUnused",
            "org.apache.druid.server.coordinator.duty.MarkEternityTombstonesAsUnused",
            "org.apache.druid.server.coordinator.duty.BalanceSegments",
            "org.apache.druid.server.coordinator.DruidCoordinator$CollectLoadQueueStats"
        ],
        "lastRunStart": "2024-08-25T14:20:30.976Z",
        "lastRunEnd": "2024-08-25T14:20:30.976Z",
        "avgRuntimeMillis": 10,
        "avgRunGapMillis": 500
    },
    {
        "name": "IndexingServiceDuties",
        "period": "PT1800S",
        "dutyNames": [
            "org.apache.druid.server.coordinator.duty.KillStalePendingSegments",
            "org.apache.druid.server.coordinator.duty.CompactSegments"
        ],
        "lastRunStart": "2024-08-25T14:20:30.976Z",
        "lastRunEnd": "2024-08-25T14:20:30.976Z",
        "avgRuntimeMillis": 100,
        "avgRunGapMillis": 1000
    },
    {
        "name": "MetadataStoreManagementDuties",
        "period": "PT3600S",
        "dutyNames": [
            "org.apache.druid.server.coordinator.duty.KillSupervisors",
            "org.apache.druid.server.coordinator.duty.KillAuditLog",
            "org.apache.druid.server.coordinator.duty.KillRules",
            "org.apache.druid.server.coordinator.duty.KillDatasourceMetadata",
            "org.apache.druid.server.coordinator.duty.KillCompactionConfig"
        ],
        "lastRunStart": "2024-08-25T14:20:30.976Z",
        "lastRunEnd": "2024-08-25T14:20:32.976Z",
        "avgRuntimeMillis": 0,
        "avgRunGapMillis": 0
    }
]
}

Logs

Coordinator become leader logs after this patch

2024-09-02T09:17:36,447 INFO [LeaderSelector[/druid/coordinator/_COORDINATOR]] org.apache.druid.server.coordinator.DruidCoordinator - I am the leader of the coordinators, all must bow! Starting coordination in [PT10S].
2024-09-02T09:17:36,569 INFO [LeaderSelector[/druid/coordinator/_COORDINATOR]] 2024-09-02T09:17:36,583 INFO [LeaderSelector[/druid/coordinator/_COORDINATOR]] org.apache.druid.server.coordinator.duty.CoordinatorDutyGroup - Created dutyGroup[HistoricalManagementDuties] with period[PT5S] and duties[[org.apache.druid.server.coordinator.duty.PrepareBalancerAndLoadQueues, org.apache.druid.server.coordinator.duty.RunRules, org.apache.druid.server.coordinator.DruidCoordinator$UpdateReplicationStatus, org.apache.druid.server.coordinator.DruidCoordinator$CollectSegmentStats, org.apache.druid.server.coordinator.duty.UnloadUnusedSegments, org.apache.druid.server.coordinator.duty.MarkOvershadowedSegmentsAsUnused, org.apache.druid.server.coordinator.duty.MarkEternityTombstonesAsUnused, org.apache.druid.server.coordinator.duty.BalanceSegments, org.apache.druid.server.coordinator.DruidCoordinator$CollectLoadQueueStats]].
2024-09-02T09:17:36,861 INFO [LeaderSelector[/druid/coordinator/_COORDINATOR]] org.apache.druid.server.coordinator.duty.CoordinatorDutyGroup - Created dutyGroup[IndexingServiceDuties] with period[PT1800S] and duties[[org.apache.druid.server.coordinator.duty.KillStalePendingSegments, org.apache.druid.server.coordinator.duty.CompactSegments]].
2024-09-02T09:17:36,864 INFO [LeaderSelector[/druid/coordinator/_COORDINATOR]] org.apache.druid.server.coordinator.duty.CoordinatorDutyGroup - Created dutyGroup[MetadataStoreManagementDuties] with period[PT3600S] and duties[[org.apache.druid.server.coordinator.duty.KillSupervisors, org.apache.druid.server.coordinator.duty.KillAuditLog, org.apache.druid.server.coordinator.duty.KillRules, org.apache.druid.server.coordinator.duty.KillDatasourceMetadata, org.apache.druid.server.coordinator.duty.KillCompactionConfig]].
2024-09-02T09:17:36,864 WARN [LeaderSelector[/druid/coordinator/_COORDINATOR]] org.apache.druid.server.coordinator.DruidCoordinator - Created [3] duty groups. DUTY RUNS WILL NOT BE LOGGED. Use API '/druid/coordinator/v1/duties' to get current status.
Duty logs for 5 minutes before this patch ~1000 lines

before_5_minute.log

Duty logs for 5 minutes after this patch ~80 lines

after_5_minute.log

Release notes

  • Remove noise from Coordinator logs.
  • Add new Coordinator API to check current status of duties
  • Emit new metrics segment/poll/time, segment/pollWithSchema/time and segment/buildSnapshot/time

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Copy Markdown
Contributor

@AmatyaAvadhanula AmatyaAvadhanula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @kfaraz! LGTM
The logs look much better, and the API is nice.

@kfaraz kfaraz added this to the 31.0.0 milestone Sep 24, 2024
@kfaraz kfaraz merged commit 9670305 into apache:master Sep 24, 2024
@kfaraz kfaraz deleted the cleanup_coord_logs branch September 24, 2024 14:16
@kfaraz
Copy link
Copy Markdown
Contributor Author

kfaraz commented Sep 24, 2024

Thanks for the reviews, @AmatyaAvadhanula , @abhishekagarwal87 !!

kfaraz added a commit to kfaraz/druid that referenced this pull request Sep 25, 2024
Description
-----------
Coordinator logs are fairly noisy and don't give much useful information (see example below).
Even when the Coordinator misbehaves, these logs are not very useful.

Main changes
------------
- Add API `GET /druid/coordinator/v1/duties` that returns a status list of all duty groups currently running on the Coordinator
- Emit metrics `segment/poll/time`, `segment/pollWithSchema/time`, `segment/buildSnapshot/time`
- Remove redundant logs that indicate normal operation of well-tested aspects of the Coordinator

Refactors
---------
- Move some logic from `DutiesRunnable` to `CoordinatorDutyGroup`
- Move stats collection from `CollectSegmentAndServerStats` to `PrepareBalancerAndLoadQueues`
- Minor cleanup of class `DruidCoordinator`
- Clean up class `DruidCoordinatorRuntimeParams`
  - Remove field `coordinatorStartTime`. Maintain start time in `MarkOvershadowedSegmentsAsUnused` instead.
  - Remove field `MetadataRuleManager`. Pass supplier to constructor of applicable duties instead.
  - Make `usedSegmentsNewestFirst` and `datasourcesSnapshot` as non-nullable as they are always required.
abhishekagarwal87 pushed a commit that referenced this pull request Sep 25, 2024
Description
-----------
Coordinator logs are fairly noisy and don't give much useful information (see example below).
Even when the Coordinator misbehaves, these logs are not very useful.

Main changes
------------
- Add API `GET /druid/coordinator/v1/duties` that returns a status list of all duty groups currently running on the Coordinator
- Emit metrics `segment/poll/time`, `segment/pollWithSchema/time`, `segment/buildSnapshot/time`
- Remove redundant logs that indicate normal operation of well-tested aspects of the Coordinator

Refactors
---------
- Move some logic from `DutiesRunnable` to `CoordinatorDutyGroup`
- Move stats collection from `CollectSegmentAndServerStats` to `PrepareBalancerAndLoadQueues`
- Minor cleanup of class `DruidCoordinator`
- Clean up class `DruidCoordinatorRuntimeParams`
  - Remove field `coordinatorStartTime`. Maintain start time in `MarkOvershadowedSegmentsAsUnused` instead.
  - Remove field `MetadataRuleManager`. Pass supplier to constructor of applicable duties instead.
  - Make `usedSegmentsNewestFirst` and `datasourcesSnapshot` as non-nullable as they are always required.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants