Skip to content

[PROPOSAL] API Endpoint for Supervisor Errors #7217

@justinborromeo

Description

@justinborromeo

Motivation (see #6571)

Currently, there's a status API endpoint that allows users to retrieve the status of a supervisor and its tasks. However, if errors are occurring, users have no way of determining what exceptions are being thrown without digging into log messages. If there was an API endpoint that returned the error-level exceptions for a specific supervisor, diagnosing issues would be much easier.

Analysis of Options

The two design decisions that need to be made are how the exceptions will be stored and the design of the API.

Storage

  1. In-Memory: Write each logged exception to an on-heap circular buffer on the leading overlord with a configurable maximum of elements.
  2. Database (use metadata store): Write every logged exception to some sort of system table. Periodically run a task to purge old events.
  3. Druid? It would be neat to be able to perform Druid queries such (especially SQL) on log data.

API

  1. Return all the stored exceptions (the number stored is configurable)
  2. Return the last m types of exceptions with the timestamps of their last n occurrences

Proposed Changes and Rationale

For the sake of simplicity, I propose using the in-memory storage approach. The one disadvantage with this approach is that it doesn't let users perform post-mortems if the Overlord goes down (unlike the database approach). This likely won't be a significant disadvantage because the logs can always be analyzed in the event of Overlord failure.

Added Configs:
druid.kafka.ingestion.numLoggedErrorsStoredPerSupervisor: The number of error log messages to store in memory per supervisor. Config value is an int.

Also for the sake of simplicity, the following API endpoints will return the stored error log messages:

Get errors from a specific supervisor: GET /druid/indexer/v1/supervisor/<supervisorId>/errors

{
  "supervisorId":_______________,
 "errors":[
    {
      "timestamp":_______________,
      "errorMessage":_______________,
      "supervisorState":________________,
    },
    {
      "timestamp":_______________,
      "errorMessage":_______________,
      "supervisorState":________________,
    }
  ]
}

Bulk-get errors from all supervisors: GET /druid/indexer/v1/supervisor/errors

{
  "supervisorErrors": [
    {
      "supervisorId":_______________,
      "errors":[
        {
          "timestamp":_______________,
          "errorMessage":_______________,
          "supervisorState":________________,
        },
        {
          "timestamp":_______________,
          "errorMessage":_______________,
          "supervisorState":________________,
        }
      ]
    }
  ]
}

I plan to achieve this behaviour by extending EmittingLogger and adding an error(Throwable t, String message, String supervisorId) method to call Logger#error() then write details about the exception to a CircularBuffer in the corresponding supervisor class. Then, an additional errors method would be added to SupervisorResource to create an endpoint that returns the contents of the corresponding buffer.

Operational impact

No significant operational impact.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions