Skip to content

Use the accessor method so state lock is used to check if perf standby#9496

Merged
ncabatoff merged 1 commit into
masterfrom
fix-perfstandby-data-race
Jul 20, 2020
Merged

Use the accessor method so state lock is used to check if perf standby#9496
ncabatoff merged 1 commit into
masterfrom
fix-perfstandby-data-race

Conversation

@ncabatoff
Copy link
Copy Markdown
Collaborator

@ncabatoff ncabatoff commented Jul 16, 2020

Solves this race:

ARNING: DATA RACE
Read at 0x00c0046f3061 by goroutine 434:
  github.com/hashicorp/vault/vault.(*Core).ForwardRequest()
      /home/ncc/gh/vault-enterprise/vault/request_forwarding.go:374 +0x7f9
  github.com/hashicorp/vault/http.forwardRequest()
      /home/ncc/gh/vault-enterprise/http/handler.go:741 +0x2bb
  github.com/hashicorp/vault/http.handleRequestForwarding.func1()
      /home/ncc/gh/vault-enterprise/http/handler.go:709 +0x17a
  net/http.HandlerFunc.ServeHTTP()
      /home/ncc/go1.13.7/src/net/http/server.go:2007 +0x51
  net/http.(*ServeMux).ServeHTTP()
      /home/ncc/go1.13.7/src/net/http/server.go:2387 +0x288
  github.com/hashicorp/vault/http.wrapHelpHandler.func1()
      /home/ncc/gh/vault-enterprise/http/help.go:24 +0x1b3
  net/http.HandlerFunc.ServeHTTP()
      /home/ncc/go1.13.7/src/net/http/server.go:2007 +0x51
  github.com/hashicorp/vault/http.wrapCORSHandler.func1()
      /home/ncc/gh/vault-enterprise/http/cors.go:29 +0xe37
  net/http.HandlerFunc.ServeHTTP()
      /home/ncc/go1.13.7/src/net/http/server.go:2007 +0x51
  github.com/hashicorp/vault/http.rateLimitQuotaWrapping.func1()
      /home/ncc/gh/vault-enterprise/http/util.go:90 +0xd45
  net/http.HandlerFunc.ServeHTTP()
      /home/ncc/go1.13.7/src/net/http/server.go:2007 +0x51
  github.com/hashicorp/vault/http.wrapDRSecondaryHandler.func1()
      /home/ncc/gh/vault-enterprise/http/util_ent.go:72 +0x960
  net/http.HandlerFunc.ServeHTTP()
      /home/ncc/go1.13.7/src/net/http/server.go:2007 +0x51
  github.com/hashicorp/vault/http.wrapGenericHandler.func1()
      /home/ncc/gh/vault-enterprise/http/handler.go:308 +0x7c0
  net/http.HandlerFunc.ServeHTTP()
      /home/ncc/go1.13.7/src/net/http/server.go:2007 +0x51
  github.com/hashicorp/go-cleanhttp.PrintablePathCheckHandler.func1()
      /home/ncc/gh/vault-enterprise/vendor/github.com/hashicorp/go-cleanhttp/handlers.go:42 +0x105
  net/http.HandlerFunc.ServeHTTP()
      /home/ncc/go1.13.7/src/net/http/server.go:2007 +0x51
  net/http.serverHandler.ServeHTTP()
      /home/ncc/go1.13.7/src/net/http/server.go:2802 +0xce
  net/http.initNPNRequest.ServeHTTP()
      /home/ncc/go1.13.7/src/net/http/server.go:3366 +0xfc
  net/http.(*initNPNRequest).ServeHTTP()
      <autogenerated>:1 +0xa6
  net/http.Handler.ServeHTTP-fm()
      /home/ncc/go1.13.7/src/net/http/server.go:86 +0x64
  net/http.(*http2serverConn).runHandler()
      /home/ncc/go1.13.7/src/net/http/h2_bundle.go:5713 +0xac

Previous write at 0x00c0046f3061 by goroutine 336:
  github.com/hashicorp/vault/vault.(*Core).waitForPerfStandby.func5()
      /home/ncc/gh/vault-enterprise/vault/ha_ent.go:238 +0x1629
  github.com/oklog/run.(*Group).Run.func1()
      /home/ncc/gh/vault-enterprise/vendor/github.com/oklog/run/group.go:38 +0x34

Goroutine 434 (running) created at:
  net/http.(*http2serverConn).processHeaders()
      /home/ncc/go1.13.7/src/net/http/h2_bundle.go:5447 +0x90f
  net/http.(*http2serverConn).processFrame()
      /home/ncc/go1.13.7/src/net/http/h2_bundle.go:4976 +0x59b
  net/http.(*http2serverConn).processFrameFromReader()
      /home/ncc/go1.13.7/src/net/http/h2_bundle.go:4934 +0x7ad
  net/http.(*http2serverConn).serve()
      /home/ncc/go1.13.7/src/net/http/h2_bundle.go:4433 +0x1493
  net/http.(*http2Server).ServeConn()
      /home/ncc/go1.13.7/src/net/http/h2_bundle.go:4031 +0xd9b
  net/http.http2ConfigureServer.func1()
      /home/ncc/go1.13.7/src/net/http/h2_bundle.go:3857 +0x117
  net/http.(*conn).serve()
      /home/ncc/go1.13.7/src/net/http/server.go:1800 +0x1d35

Goroutine 336 (running) created at:
  github.com/oklog/run.(*Group).Run()
      /home/ncc/gh/vault-enterprise/vendor/github.com/oklog/run/group.go:37 +0x11c
  github.com/hashicorp/vault/vault.(*Core).waitForPerfStandby()
      /home/ncc/gh/vault-enterprise/vault/ha_ent.go:298 +0xf93
  github.com/hashicorp/vault/vault.addEnterpriseHaActorsImpl.func1()
      /home/ncc/gh/vault-enterprise/vault/ha_ent.go:48 +0x53
  github.com/oklog/run.(*Group).Run.func1()
      /home/ncc/gh/vault-enterprise/vendor/github.com/oklog/run/group.go:38 +0x34

Copy link
Copy Markdown
Contributor

@mjarmy mjarmy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ncabatoff ncabatoff merged commit b95822b into master Jul 20, 2020
@ncabatoff ncabatoff deleted the fix-perfstandby-data-race branch July 20, 2020 14:34
@ncabatoff ncabatoff modified the milestones: 1.4.4, 1.6 Aug 17, 2020
pull Bot pushed a commit to Reality2byte/vault that referenced this pull request Sep 19, 2025
… (hashicorp#9496)

The code that loads the trusted certificate cache for cert-based
authentication ignores any error that occurs while attempting to load
any of the certificates that it finds. Undoubtedly some deployments
have broken certificates or other non-certificate files stored in
their respective back-ends, and so this is important behavior: we
don't want to fail authentication just because `README.md` is not a
valid certificate!

In addition, because listing files and loading certificates is
expensive, the server maintains a cache of trusted certificates. This
cache is populated the first time it's needed, and then used for the
lifetime of the process. If a file fails to load as a certificate,
then it is simply not included in the cache.

These two things lead to a problem when using a backend that might be
subject to transient failures: a hiccough in the certificate loading
process can cause the server to establish a cache that is missing an
otherwise valid certificate. This can then lead to clients failing to
authenticate to the server, until such time as the server is restarted
and the cache reloaded.

This change makes the certificate cache more resilient to loading
failures, by caching partial successes. With this patch, the cache
behavior becomes:

- If the cache exists *and* is either complete or it is not yet time
  to attempt to reload the certificates, then the cached results are
  used without reservation.

- Otherwise we attempt to load the certificates from storage:

  - If the cache does not already exist then a new, empty cache is
    created.

  - The storage is listed, we attempt to load everything in storage,
    skipping things that we have already successfully loaded, and
    skipping things that we cannot load, as usual.

  - Once we have attempted to load everything from storage, if there
    were any errors, we compute a deadline for retrying the load, with
    an exponentially increasing delay. If there were no errors, then
    the cache is considered complete, and there will be no retry.

This has the nice behavior that we recover from transient failures
eventually, while the exponential back-off ensures that we don't waste
too much time attempting to load certificates that can never be
loaded.

Co-authored-by: John Doty <john.doty@databricks.com>
Co-authored-by: Steven Clark <steven.clark@hashicorp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants