-
Notifications
You must be signed in to change notification settings - Fork 7
rfcs: Add Block Triggers hash #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
That3Percent
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your thoughtful RFC @evaporei. I like this direction but have some considerations for you. Let me know what you think.
|
|
||
| Technical considerations: | ||
|
|
||
| - It will probably make sense to first convert this `Vec<Trigger>` into another |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why will it make sense to convert an ordered collection to an unordered collection here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea with the Sets and with this separate structure is to facilitate hashing. Right now all Triggers have an Arc that's pointing to the same Block. IMO we should have a single structure that could be traversed once, without having to hash the same data twice.
| subgraph_id | QmZwNVFxXjUcSD42n5Lztg2D4DorZLUPGv8gTAXQpf5Wy9 | ||
| block_number | 22062424 | ||
| block_hash | \x92cb334274a3f78c4a2bab96853992deca00fd57ee8c44c69c93dfd9f012f707 | ||
| digest | \x2b6ecfca88033617cf5bfbfd332ba593591bab1a955e049271d2fd7696a55a9e5d4024282b96a639f2077d0cdaa99000734768ae106498a8514f0819b64fc8f65c9637dd788e472e3287eb15e3945338c27e38eaec8a27268aca8cbb45f59d7cbb9893611b9e56f4850dc5696673f7a8c7b695d2d9127811856e1ccc76a942a29dca9e1a92287c807470c3dfd7d3f0fffeb66d29a22278a7746ddf4fb9a6f76a6ac11dd98cc65f5bfacda4ebdbca51e161fadaffd3434594886ded4301929e9a6098995277c9b36513a0592e06d0a6a8035506c281bd7e0e14e01149a9122cfd6e980f65519a8eb28b8612c245ef66f671761361d4317d53c9e4af03ed4999a0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separately, we are looking to deploy a faster and more determinism-sensitive PoI. When using PoIv2 we could roll this information into the PoI instead of maintaining a separate table. There are tradeoffs to consider here, but it is worth considering. Eg: Use a separate table as-described when using PoIv1, and roll the hash into the canonical PoI when PoIv2 is enabled. That removes a table and allows determinism issues to be detected without a separate protocol. What do you think?
| - On the hash itself | ||
| - The `digest` field can be a lot smaller, the value here is from a POI, | ||
| which we don't need to keep the same properties; | ||
| - It should be calculated in a separate thread/future to the mapping code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an a-side, the PoI can be calculated in a separate thread as well. But, I'm generally skeptical of using threads in this way within graph-node.
The maximum amount of parallelism that can be achieved is relative to the number of cores. We already have a unit of parallelism exploited natively when indexing in a production environment (the subgraphs themselves). Once that limit is reached (for example, by indexing more than num_cores worth of subgraphs assuming they are CPU bound), any additional parallelism will actually degrade performance because it will increase overhead for making everything threaded (eg: message passing, locks, context switching, etc). If pipe-lining is done well, only the slowest operation can be optimized.
It's hard to say categorically that this is "wrong" because it's context-dependent, but performance tweaks like threading and caching need to be handled with care because they are not magical and often cause the performance problems they are trying to solve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another consideration for using threads is that there is overhead for converting values to owned, which is needed because passing references across threads rightfully angers the borrow checker. Consider: with modern hash functions that operate at the limit of memory bandwidth it is generally faster to hash a string than to clone it! You can't pass references across threads, so we're getting into Arc or Clone territory, which adds non-negligible overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, I suggested this more to not block indexing. This could change if we follow @lutter 's suggestion about making this determinism trace for debugging purposes, so it wouldn't affect graph-node in the production "mode".
| data structure, where there's only one `LightEthereumBlock` and all repeated | ||
| data gets converted into a `Set` (eg: `BTreeSet`); | ||
| - On the hash itself | ||
| - The `digest` field can be a lot smaller, the value here is from a POI, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new PoI has a much smaller "intermediate digest". Don't quote me on this but I think it's about a 20x improvement. We should be able to use the new PoI methods whether or not PoIv2 is enabled (see above) by always using the latest stable-hash when a separate table is used.
I'm not yet convinced that the value can be smaller here if we need a multi-set hash. The reason we use a multi-set hash is because when we have multiple data sources (for example, IPFS) and those data sources have unreliable availability, we need an efficient means to remove values from the hash. The seeming bloat here exists to have a high-quality hash which affords that feature. (The state space of a multi-set hash needs to be larger than the final digest, and we need to store the hasher state instead of the digest).
If we need to do the same thing (for example, an IPFS data source makes an RPC call) then we'll need PoI-like semantics. It may also be convenient to re-use stable-hash anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok!
|
Also linking here: graphprotocol/graph-node#3554 (comment) as an alternative to prevent determinism issues with verifiable queries rather than detect them after the fact. |
| digest | \x2b6ecfca88033617cf5bfbfd332ba593591bab1a955e049271d2fd7696a55a9e5d4024282b96a639f2077d0cdaa99000734768ae106498a8514f0819b64fc8f65c9637dd788e472e3287eb15e3945338c27e38eaec8a27268aca8cbb45f59d7cbb9893611b9e56f4850dc5696673f7a8c7b695d2d9127811856e1ccc76a942a29dca9e1a92287c807470c3dfd7d3f0fffeb66d29a22278a7746ddf4fb9a6f76a6ac11dd98cc65f5bfacda4ebdbca51e161fadaffd3434594886ded4301929e9a6098995277c9b36513a0592e06d0a6a8035506c281bd7e0e14e01149a9122cfd6e980f65519a8eb28b8612c245ef66f671761361d4317d53c9e4af03ed4999a0 | ||
| provider | mainnet-0 | ||
| created_at | 2022-05-13 22:27:08.304456+00 | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the storage format:
- we don't need
idandsubgraph_id, they are implicit from the schema namesgdNNNN - we only need the block number, not the hash, since in the context of a subgraph the block number is unambiguous (it must be on the chain going back from the subgraph head)
- provider is a bit tricky: users can change the endpoint in the configuration and keep the same provider name. If we want to track that, we should have a global
providers(id: int4, url, label)table in the primary and store the provideridhere - I am not sure how useful the
created_atcolumn would be; it will add 8 bytes for each row
I would actually just use the current poi2$ table and add columns provider_id, created_at (maybe), and input_digest or some such
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea on the subgraph_id! I was thinking about the created_at in the sense of not trusting that provider data will be immutable over time.
| to index subgraphs that **use those values in their mappings**. If by any chance | ||
| in Firehose shootout we've done in the **integration cluster**, there were no | ||
| subgraphs using these values **we wouldn't spot any POI differences**, which | ||
| is a **very severe issue**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The examples given only pertain to block-level attributes. For those, it might make sense to actually store some sort of digest for each block per provider (instead of per subgraph)
The differences that are usually the most difficult to spot are issues with eth_call which are AFAICT not covered by this proposal, but it's reasonable to assume that divergence there leads to divergence in the PoI.
Since this would all be more for forensics than to support regular queries, another approach to storing this data would be to standardize the logging format that indexing uses better, and have tooling that can compare two indexing logs in a convenient way, rather than store it in the database.
How do you see this mechanism be used? From the description it sounds that it would mostly be for testing graph-node itself and/or checking different providers, i.e. more a test tool than something that is needed in every production installation of graph-node. My take is that in production, it doesn't really matter that a provider supplies diverging data as long as it doesn't affect the PoI.
If this is for provider testing, a tool that more directly exercises providers and compares their responses might actually be more usable and useful. That tool could, e.g., be a special indexing mode that runs the block stream against two (or more?) providers and compares their responses; making that more explicit would have the advantage that error message could clearly state what part of the response is diverging, rather than just producing two separate digests and forcing us to investigate what caused the divergence.
Long story short: it would be good to explain how the data that is collected here would be used, and what sort of operations it needs to enable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another approach to storing this data would be to standardize the logging format that indexing uses better, and have tooling that can compare two indexing logs in a convenient way
Jannis and I actually already implemented this back in the day. There is an environment variable GRAPH_LOG_POI_EVENTS to turn on PoI logging. The idea is that the hash is an optimistic check with lower performance overhead that allows you to index and compare a single value at the end, and only if there is a discrepancy perform a binary search to find the offending block then try again with logging. It's not an either/or because what can be logged can be hashed.
|
@lutter Agree on all points. |
|
@That3Percent @lutter thanks for the review and great considerations! After I investigate a few determinism issues that showed up recently I'll come back to this and see what I can improve on this. |
Addresses graphprotocol/graph-node#3554