-
Notifications
You must be signed in to change notification settings - Fork 7
rfcs: Add Block Triggers hash #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,226 @@ | ||
| # RFC-0009: Block Triggers hash | ||
|
|
||
| <dl> | ||
| <dt>Author</dt> | ||
| <dd>Eva Paulino Pace</dd> | ||
|
|
||
| <dt>RFC pull request</dt> | ||
| <dd><a href="https://github.com/graphprotocol/rfcs/pull/26">https://github.com/graphprotocol/rfcs/pull/26</a></dd> | ||
|
|
||
| <dt>Date of submission</dt> | ||
| <dd>2022-05-16</dd> | ||
|
|
||
| <dt>Date of approval</dt> | ||
| <dd>YYYY-MM-DD</dd> | ||
| </dl> | ||
|
|
||
| ## Contents | ||
|
|
||
| <!-- toc --> | ||
|
|
||
| ## Summary | ||
|
|
||
| The "Block Triggers hash" will allow indexers to compare their subgraph's main | ||
| source of entropy, the chain provider, by storing a hash of the fetched data | ||
| that gets into any subgraph's mapping code. | ||
|
|
||
| ## Goals & Motivation | ||
|
|
||
| One of The Graph's coolest features is that queries are deterministic, and | ||
| given a `Qm` subgraph hash, indexing it should **always** give the same result. | ||
| This is possible because we inherit the blockchain's determinism property, | ||
| however there's a big loophole which can break this amazing feature, which is | ||
| **the chain provider**. | ||
|
|
||
| Currently the main (or only) type of connection we give as option to indexers | ||
| (in The Graph Network) is the **JSON-RPC** one. To use it, they can either run a | ||
| node themselves or use a third party service like Alchemy. Either way the | ||
| provider can be faulty and give incorrect results for a number of different | ||
| reasons. | ||
|
|
||
| To be a little more specific, let's say there are indexers/nodes `A` and `B`. | ||
| Both are indexing subgraph `Z`. Indexer `A` is using Alchemy and `B` is using | ||
| Infura. | ||
|
|
||
| Given a block `14_722_714` of a determined hash, both providers will very | ||
| likely give the same result for these two values (block number and hash), | ||
| however other fields such as `gas_used` or `total_difficulty` could be | ||
| incorrect. And yes, ideally they would always be correct since they are chain | ||
| providers, that's their main job, however what I'm describing is the exact | ||
| issue we've faced when testing indexing Ethereum mainnet with the Firehose. | ||
|
|
||
| These field/value differences between providers are directly fed into the | ||
| subgraph mappings, which are the current input of the POI algorithm and the | ||
| base of The Graph's determinism property. Not taking the possible faultyness | ||
| of the chain providers into account, can break determinism altogether. | ||
|
|
||
| And the biggest problem today is that, to spot these POI differences, we have | ||
| to index subgraphs that **use those values in their mappings**. If by any chance | ||
| in Firehose shootout we've done in the **integration cluster**, there were no | ||
| subgraphs using these values **we wouldn't spot any POI differences**, which | ||
| is a **very severe issue**. | ||
|
|
||
| POI differences described in the Firehose shootout for reference: | ||
| https://gist.github.com/evaporei/660e57d95e6140ca877f338426cea200. | ||
|
|
||
| So in summary, the problems being described above are: | ||
|
|
||
| - That currently we consider the **chain provider** as a **source of truth**, | ||
| which can only be questioned in behalf of re-orgs; | ||
| - We don't have a good way to compare provider input (that could spot POI | ||
| differences) without the indirection of a subgraph mapping. | ||
|
|
||
| ## Urgency | ||
|
|
||
| The urgency per se is low because we can spot POI differences by using | ||
| subgraphs themselves, but in the long run it's very likely we'll need some sort | ||
| of way to compare provider input between nodes/indexers so that we're more | ||
| efficient and accurate when testing new providers or forms of fetching chain | ||
| data (eg: Firehose). | ||
|
|
||
| ## Terminology | ||
|
|
||
| `bth` is a candidate table name for "block triggers hash" (up to change). | ||
|
|
||
| ## Detailed Design | ||
|
|
||
| A subgraph can have multiple handlers, each of them can have as their parameter | ||
| one of the following types (in Ethereum): | ||
|
|
||
| - Event log; | ||
| - Call; | ||
| - Block. | ||
|
|
||
| Currently this is the data structure that we use in code for representing what | ||
| gets into a mapping handler ([source](https://github.com/graphprotocol/graph-node/blob/b8faa45fc96f53226648ca65ddacff164b75e018/chain/ethereum/src/trigger.rs#L44)): | ||
|
|
||
| ```rust | ||
| pub enum MappingTrigger { | ||
| Log { | ||
| block: Arc<LightEthereumBlock>, | ||
| transaction: Arc<Transaction>, | ||
| log: Arc<Log>, | ||
| params: Vec<LogParam>, | ||
| receipt: Option<Arc<TransactionReceipt>>, | ||
| }, | ||
| Call { | ||
| block: Arc<LightEthereumBlock>, | ||
| transaction: Arc<Transaction>, | ||
| call: Arc<EthereumCall>, | ||
| inputs: Vec<LogParam>, | ||
| outputs: Vec<LogParam>, | ||
| }, | ||
| Block { | ||
| block: Arc<LightEthereumBlock>, | ||
| }, | ||
| } | ||
| ``` | ||
|
|
||
| > Note: This actually gets converted to either an `AscEthereumLog`, | ||
| > `AscEthereumCall` or `AscEthereumBlock` to be called correctly by | ||
| > the subgraph's AssemblyScript mapping/handler code. | ||
|
|
||
| This data that gets filled by whatever provider the node/indexer is using, which | ||
| can be from **JSON-RPC** or the **Firehose**. | ||
|
|
||
| > Note: Each chain has their `MappingTrigger` definition, we're using Ethereum's | ||
| > for simplicity. | ||
|
|
||
| The code that actually executes the mapping code for each trigger candidate | ||
| looks like this (with a ton of simplification): | ||
|
|
||
| ```rust | ||
| fn process_triggers( | ||
| &self, | ||
| triggers: Vec<C::TriggerData>, | ||
| ) -> Result<(), MappingError> { | ||
| for trigger in triggers { | ||
| self.process_trigger(&trigger)?; | ||
| } | ||
|
|
||
| Ok(()) | ||
| } | ||
| ``` | ||
|
|
||
| This is executed per relevant block, given the subgraph's data sources | ||
| (contracts that it's "listening"). To give a bit of a perspective in numbers, | ||
| the `uniswap-v2` subgraph for example usually has around ~30 triggers per block. | ||
|
|
||
| The proposal here would be to hash this vector per block and save in a database | ||
| record (under the subgraph namespace `sgdX`), resulting in a value that could | ||
| look like this: | ||
|
|
||
| ``` | ||
| id | 61c7e6beb2cda15ca38fb09506abecd6429f53ce985b9509a0319ee9bb3b013d | ||
| subgraph_id | QmZwNVFxXjUcSD42n5Lztg2D4DorZLUPGv8gTAXQpf5Wy9 | ||
| block_number | 22062424 | ||
| block_hash | \x92cb334274a3f78c4a2bab96853992deca00fd57ee8c44c69c93dfd9f012f707 | ||
| digest | \x2b6ecfca88033617cf5bfbfd332ba593591bab1a955e049271d2fd7696a55a9e5d4024282b96a639f2077d0cdaa99000734768ae106498a8514f0819b64fc8f65c9637dd788e472e3287eb15e3945338c27e38eaec8a27268aca8cbb45f59d7cbb9893611b9e56f4850dc5696673f7a8c7b695d2d9127811856e1ccc76a942a29dca9e1a92287c807470c3dfd7d3f0fffeb66d29a22278a7746ddf4fb9a6f76a6ac11dd98cc65f5bfacda4ebdbca51e161fadaffd3434594886ded4301929e9a6098995277c9b36513a0592e06d0a6a8035506c281bd7e0e14e01149a9122cfd6e980f65519a8eb28b8612c245ef66f671761361d4317d53c9e4af03ed4999a0 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Separately, we are looking to deploy a faster and more determinism-sensitive PoI. When using PoIv2 we could roll this information into the PoI instead of maintaining a separate table. There are tradeoffs to consider here, but it is worth considering. Eg: Use a separate table as-described when using PoIv1, and roll the hash into the canonical PoI when PoIv2 is enabled. That removes a table and allows determinism issues to be detected without a separate protocol. What do you think? |
||
| provider | mainnet-0 | ||
| created_at | 2022-05-13 22:27:08.304456+00 | ||
| ``` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regarding the storage format:
I would actually just use the current
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice idea on the |
||
|
|
||
| Technical considerations: | ||
|
|
||
| - It will probably make sense to first convert this `Vec<Trigger>` into another | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why will it make sense to convert an ordered collection to an unordered collection here?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The idea with the |
||
| data structure, where there's only one `LightEthereumBlock` and all repeated | ||
| data gets converted into a `Set` (eg: `BTreeSet`); | ||
| - On the hash itself | ||
| - The `digest` field can be a lot smaller, the value here is from a POI, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The new PoI has a much smaller "intermediate digest". Don't quote me on this but I think it's about a 20x improvement. We should be able to use the new PoI methods whether or not PoIv2 is enabled (see above) by always using the latest I'm not yet convinced that the value can be smaller here if we need a multi-set hash. The reason we use a multi-set hash is because when we have multiple data sources (for example, IPFS) and those data sources have unreliable availability, we need an efficient means to remove values from the hash. The seeming bloat here exists to have a high-quality hash which affords that feature. (The state space of a multi-set hash needs to be larger than the final digest, and we need to store the hasher state instead of the digest). If we need to do the same thing (for example, an IPFS data source makes an RPC call) then we'll need PoI-like semantics. It may also be convenient to re-use
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok! |
||
| which we don't need to keep the same properties; | ||
| - It should be calculated in a separate thread/future to the mapping code | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As an a-side, the PoI can be calculated in a separate thread as well. But, I'm generally skeptical of using threads in this way within graph-node. The maximum amount of parallelism that can be achieved is relative to the number of cores. We already have a unit of parallelism exploited natively when indexing in a production environment (the subgraphs themselves). Once that limit is reached (for example, by indexing more than num_cores worth of subgraphs assuming they are CPU bound), any additional parallelism will actually degrade performance because it will increase overhead for making everything threaded (eg: message passing, locks, context switching, etc). If pipe-lining is done well, only the slowest operation can be optimized. It's hard to say categorically that this is "wrong" because it's context-dependent, but performance tweaks like threading and caching need to be handled with care because they are not magical and often cause the performance problems they are trying to solve.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another consideration for using threads is that there is overhead for converting values to owned, which is needed because passing references across threads rightfully angers the borrow checker. Consider: with modern hash functions that operate at the limit of memory bandwidth it is generally faster to hash a string than to clone it! You can't pass references across threads, so we're getting into
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's fair, I suggested this more to not block indexing. This could change if we follow @lutter 's suggestion about making this determinism trace for debugging purposes, so it wouldn't affect |
||
| execution, so it doesn't degrade performance; | ||
| - On the database side | ||
| - As mentioned before, this data should go under the subgraph's namespaced | ||
| database, aka `sgdX`, just as where the current POI table stands `poi2$`; | ||
| - The table name could be something along the lines of `bth` (block triggers | ||
| hash). | ||
|
|
||
| This value could then be exposed via the `indexing status API`, just like the | ||
| POI, but it could be used by indexers (or us) to compare if the data | ||
| `graph-node` fetched, given a `provider`, `subgraph` and `block` has the same | ||
| result. | ||
|
|
||
| We could also add a utility `graphman` command that fetches that specific data | ||
| for the `provider`, `subgraph` and `block`. This data could be either be | ||
| compared with other providers in `graph-node` itself, or the task could perhaps | ||
| be delegated to the new `graph-ixi` cross checking tool. | ||
|
|
||
| This would allow us (and indexers!) to be: | ||
|
|
||
| - Confident that different providers give the same results; | ||
| - Give another tool to catch determinism bugs (very important given the | ||
| properties we want to maintain); | ||
| - And perhaps for the future, add value to the network itself outside the | ||
| side utility. | ||
|
|
||
| ## Compatibility | ||
|
|
||
| There are no backwards compatibility concerns. | ||
|
|
||
| ## Drawbacks and Risks | ||
|
|
||
| The only point that should be well thought of in the implementation side is | ||
| how to calculate the `Vec<Trigger>` hash efficiently, so that it doesn't go | ||
| over the same data more than once (eg: same block). | ||
|
|
||
| Also, the new database table should keep the same properties that the other | ||
| subgraph's entity (and POI) tables have, which have their data "re-made" on | ||
| re-orgs, rewinds, etc. | ||
|
|
||
| ## Alternatives | ||
|
|
||
| We could store each `MappingTrigger` hash in a separate database recored, | ||
| however that would create many more database results than necessary in my | ||
| opinion. This would make more sense if we were actually storing the data | ||
| itself, not a hash of it. | ||
|
|
||
| ## Open Questions | ||
|
|
||
| - What hash algorithm to use? | ||
| - What table name should we use? | ||
| - Do we want to store any more provider data? | ||
|
|
||
| > Note: Ethereum calls aren't being included in this RFC, however since we | ||
| > already store those in a "call cache" table, we can just hash the stored | ||
| > data, perhaps in a different table. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The examples given only pertain to block-level attributes. For those, it might make sense to actually store some sort of digest for each block per provider (instead of per subgraph)
The differences that are usually the most difficult to spot are issues with
eth_callwhich are AFAICT not covered by this proposal, but it's reasonable to assume that divergence there leads to divergence in the PoI.Since this would all be more for forensics than to support regular queries, another approach to storing this data would be to standardize the logging format that indexing uses better, and have tooling that can compare two indexing logs in a convenient way, rather than store it in the database.
How do you see this mechanism be used? From the description it sounds that it would mostly be for testing
graph-nodeitself and/or checking different providers, i.e. more a test tool than something that is needed in every production installation ofgraph-node. My take is that in production, it doesn't really matter that a provider supplies diverging data as long as it doesn't affect the PoI.If this is for provider testing, a tool that more directly exercises providers and compares their responses might actually be more usable and useful. That tool could, e.g., be a special indexing mode that runs the block stream against two (or more?) providers and compares their responses; making that more explicit would have the advantage that error message could clearly state what part of the response is diverging, rather than just producing two separate digests and forcing us to investigate what caused the divergence.
Long story short: it would be good to explain how the data that is collected here would be used, and what sort of operations it needs to enable.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lutter
Jannis and I actually already implemented this back in the day. There is an environment variable
GRAPH_LOG_POI_EVENTSto turn on PoI logging. The idea is that the hash is an optimistic check with lower performance overhead that allows you to index and compare a single value at the end, and only if there is a discrepancy perform a binary search to find the offending block then try again with logging. It's not an either/or because what can be logged can be hashed.