Skip to content

Conversation

@himaschal
Copy link
Collaborator

Changes made:

This PR introduces basic OpenTelemetry-style trace and span ID propagation and structured logging middleware to the Trillian server. The changes are focused on improving observability and request tracing for both HTTP and gRPC endpoints.

  • Added request tracing - with a focus on timing - to Trillian log server which allows for structured logging of a request from http through to the gRpc calls.
  • Added structured logging package with transaction and span ID tracking
  • HTTP middleware for automatic request timing and context injection
  • gRPC interceptor for consistent tracing across service boundaries
  • Context propagation between HTTP and gRPC calls using standardized headers
  • JSON logging format
  • Added / updated tests
  • The logging is handled by the logrus library - which is useful for structured output.
  • Conforms to the opentelemetry standards for request correlation id's etc.

The are 2 health endpoints exposed in terms of the trillian project:

  • /metrics which uses the prometheus http handler and by default are not logged (this is standard practice) ie. we will not see any logs for this
  • /healthz which uses the standard http handler, runs on schedule, and we will see logs for these
{"elapsed":"0ms","event_id":"timing","level":"info","method":"GET","msg":"request completed","path":"/healthz","span_id":"e52096a54fe76c74","status":200,"time":"2025-07-30T08:03:39.413Z","trace_id":"26870f0e8feb8c310f1b327e9740d282"}
{"elapsed":"0ms","event_id":"timing","level":"info","method":"GET","msg":"request completed","path":"/healthz","span_id":"4955ed2f3c011b0e","status":200,"time":"2025-07-30T08:04:09.413Z","trace_id":"6ac1f80f10b665da523e5ac0fb0f0018"}
{"elapsed":"0ms","event_id":"timing","level":"info","method":"GET","msg":"request completed","path":"/healthz","span_id":"3da73e7ec906d795","status":200,"time":"2025-07-30T08:04:39.412Z","trace_id":"82b811e0a0bded795379b0b173b4a478"}
{"elapsed":"0ms","event_id":"timing","level":"info","method":"GET","msg":"request completed","path":"/healthz","span_id":"dfc9cdddae887aaf","status":200,"time":"2025-07-30T08:05:09.410Z","trace_id":"c4d03a16c17588e50f53a22c6c4197a0"}
{"elapsed":"1ms","event_id":"timing","level":"info","method":"GET","msg":"request completed","path":"/healthz","span_id":"1f22ec8af0ea9848","status":200,"time":"2025-07-30T08:05:39.411Z","trace_id":"03706449e94d468fdaebb7f203acf5d1"}
{"elapsed":"0ms","event_id":"timing","level":"info","method":"GET","msg":"request completed","path":"/healthz","span_id":"e956b94da9578b2f","status":200,"time":"2025-07-30T08:06:09.413Z","trace_id":"6164ac16d5df1f7bce54f26eea05064c"}

We can also see that the gRpc call is logged (tested from invoking the get-sth call and tracing through)

curl -H 'traceparent: 00-$(openssl rand -hex 16)-$(openssl rand -hex 8)-01' http://localhost:6962/log/ct/v1/get-sth
Handling connection for 6962
{"tree_size":0,"timestamp":1753850472067,"sha256_root_hash":"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=","tree_head_signature":"BAMASDBGAiEAvDCfQQenVkH7s4zI+Nw5Ueugok7ntLsTCoZ5lN8ZaWsCIQCAAV2jofjsu0uEyCNr6YwsXzgXJ/tVihO57ORIqZLfKw=="}%

Expected:
Both the CTFE and Logserver should have the same trace_id and different span id's

CTFE logs:

{"elapsed":"3ms","event_id":"timing","level":"info","method":"GET","msg":"request completed","path":"/log/ct/v1/get-sth","span_id":"613cf118eb34bf77","status":200,"time":"2025-07-30T08:13:05.593Z","trace_id":"5e7facbbfcd4725608bde5bac22f2606"}

Logserver (Trillian):

{"elapsed":0,"event_id":"timing","level":"info","method":"/trillian.TrillianLog/GetLatestSignedLogRoot","msg":"gRPC call completed","span_id":"87be766ec02e7cef","status":"OK","time":"2025-07-30T08:13:05.592Z","trace_id":"5e7facbbfcd4725608bde5bac22f2606"}

himaschal and others added 12 commits July 17, 2025 13:12
…equest to response. Added transaction_id and span_id to the requests and propogate them through to the gRpc calls to help trace a request from end to end. NB. this also requires changes on the certificate-transparency-go project (ctfe) to be useful.
… user input

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
…orrect behavior. Each service should generate its own span id
- Replace legacy transaction_id with standard trace_id (32-char hex)
- Generate OpenTelemetry-standard IDs using crypto/rand (128-bit trace, 64-bit span)
- Implement proper service boundary handling (trace_id propagates, span_id doesn't)
- Add gRPC metadata support for x-trace-id header propagation
- Update HTTP middleware to integrate with logging system
- Add sanitization for security in log output (newline removal)
- Update all tests to reflect new field names and behavior
- Maintain backward compatibility for Trillian's role as backend gRPC service
- Add MiddlewareFunc for HTTP handler function wrapping
- Ensure proper trace correlation across CTFE->Trillian service calls
@himaschal himaschal changed the title RA-7777: Added event timing that conforms to opentelemetry standards. RA-7777: [Trillian] Added event timing that conforms to opentelemetry standards. Jul 30, 2025
/trillian_map_server
default.etcd
cockroach-data/
server/.DS_Store

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to just .DS_Store and it will exclude it from all folders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants