-
Notifications
You must be signed in to change notification settings - Fork 39
Backfill page on sub error #947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
How to use the Graphite Merge QueueAdd either label to this PR to merge it via the merge queue:
You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has enabled the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. |
6368621 to
4cd89cc
Compare
| logs, err := r.backfillPage(r.ctx, cfg, backfillFromBlockNumber) | ||
| if err != nil { | ||
| logger.Error("failed to backfill page, closing", zap.Error(err)) | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you terminate the goroutine here, what will restart the watcher? Same for all returns in this error block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this should be a Fatal.
If we can't guarantee the page has been backfilled, or the sub hasn't been recreated, we're at a data integrity risk which could lead to gaps and undefined behavior when a client invokes the API.
Let me know what you think @mkysel
| case err, open := <-sub.Err(): | ||
| if !open { | ||
| logger.Error("subscription channel closed, closing watcher") | ||
| return | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when doing an initial backfill, the sub cancellation might happen quite a bit while we are still backfilling. We should totally do the right thing, but tearing down everything seems like overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking on this one for a couple days. I've come to believe this is an overkill, but also unnecessary. I'll come today with a different approach!
| case err, open := <-sub.Err(): | ||
| if !open { | ||
| logger.Error("subscription channel closed, closing watcher") | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean we will stop indexing until the entire process gets restarted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed checking the sub error channel is open. Instead we'll rebuild the sub when there's an actual error. And the sub error channel is recreated by go-ethereum (the owner)
5e8727f to
26b1ce7
Compare
b87fc6d to
476bbed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue on line in pkg/indexer/rpc_streamer/rpc_log_streamer.go:1:
The field backfillBlockPageSize on RPCLogStreamer is read by GetNextPage and written in the backfill error handling (r.backfillBlockPageSize = blockPageSize) across different watcher goroutines without synchronization. This can lead to data races and unpredictable behavior. Consider using atomic operations or a mutex to synchronize access to r.backfillBlockPageSize.
+import "sync/atomic"
@@
- r.backfillBlockPageSize = blockPageSize
+ atomic.StoreUint64(&r.backfillBlockPageSize, blockPageSize)
@@
- toBlock := min(fromBlockNumber+r.backfillBlockPageSize-1, highestBlock)
+ toBlock := min(fromBlockNumber+atomic.LoadUint64(&r.backfillBlockPageSize)-1, highestBlock)🚀 Reply to ask Macroscope to explain or update this suggestion.
👍 Helpful? React to give us feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a watcher per blockchain process, and the indexer has N blockchain processes. The block page size is unique per blockchain.
It shouldn't matter which goroutine adjusts the block page size, as they'll be always relevant to the others watchers in the same blockchain process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification on the architecture. While I understand that functionally all watchers in the same blockchain process should use the same block page size, this is still a data race from Go's memory model perspective - multiple goroutines accessing the same memory location where at least one is writing without synchronization.
Even if the race is benign in practice, it could:
- Be flagged by
go run -race - Lead to undefined behavior per the Go memory model
- Cause issues with compiler optimizations
The atomic operations are lightweight and would eliminate the race condition entirely without affecting performance. Would you be open to that approach for correctness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What proposal do you have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose using atomic operations to fix the data race. Two changes would be needed:
- Change the field type in
RPCLogStreamerfromuint64toatomic.Uint64(or usesync/atomicfunctions) - Update the two access locations:
Writing (in watchContract):
import "sync/atomic"
// Instead of:
r.backfillBlockPageSize = blockPageSize
// Use:
atomic.StoreUint64(&r.backfillBlockPageSize, blockPageSize)Reading (in GetNextPage):
// Instead of:
toBlock := min(fromBlockNumber+r.backfillBlockPageSize-1, highestBlock)
// Use:
toBlock := min(fromBlockNumber+atomic.LoadUint64(&r.backfillBlockPageSize)-1, highestBlock)This is a minimal change that eliminates the race condition while maintaining the same functionality.
Backfill page on subscription error by restructuring RPCLogStreamer error handling and reducing MaxChainDisconnectTime from 300s to 60s
RPCLogStreamer.watchContractmethod in rpc_log_streamer.go to handle subscription errors within the main loop and restart backfill process when errors occurMaxChainDisconnectTimeinAppChainOptionsstruct from 300s to 60s in options.gobuildSubscriptionWithBackoffmethod and changes channel buffer size from hardcoded 100 to dynamic sizing based on expected logs per block📍Where to Start
Start with the
watchContractmethod in rpc_log_streamer.go to understand the restructured main loop and error handling logic.Changes since #947 opened
Macroscope summarized 25da832.