diff --git a/rfcs/0009-block-triggers-hash.md b/rfcs/0009-block-triggers-hash.md new file mode 100644 index 0000000..308f71e --- /dev/null +++ b/rfcs/0009-block-triggers-hash.md @@ -0,0 +1,226 @@ +# RFC-0009: Block Triggers hash + +
+
Author
+
Eva Paulino Pace
+ +
RFC pull request
+
https://github.com/graphprotocol/rfcs/pull/26
+ +
Date of submission
+
2022-05-16
+ +
Date of approval
+
YYYY-MM-DD
+
+ +## Contents + + + +## Summary + +The "Block Triggers hash" will allow indexers to compare their subgraph's main +source of entropy, the chain provider, by storing a hash of the fetched data +that gets into any subgraph's mapping code. + +## Goals & Motivation + +One of The Graph's coolest features is that queries are deterministic, and +given a `Qm` subgraph hash, indexing it should **always** give the same result. +This is possible because we inherit the blockchain's determinism property, +however there's a big loophole which can break this amazing feature, which is +**the chain provider**. + +Currently the main (or only) type of connection we give as option to indexers +(in The Graph Network) is the **JSON-RPC** one. To use it, they can either run a +node themselves or use a third party service like Alchemy. Either way the +provider can be faulty and give incorrect results for a number of different +reasons. + +To be a little more specific, let's say there are indexers/nodes `A` and `B`. +Both are indexing subgraph `Z`. Indexer `A` is using Alchemy and `B` is using +Infura. + +Given a block `14_722_714` of a determined hash, both providers will very +likely give the same result for these two values (block number and hash), +however other fields such as `gas_used` or `total_difficulty` could be +incorrect. And yes, ideally they would always be correct since they are chain +providers, that's their main job, however what I'm describing is the exact +issue we've faced when testing indexing Ethereum mainnet with the Firehose. + +These field/value differences between providers are directly fed into the +subgraph mappings, which are the current input of the POI algorithm and the +base of The Graph's determinism property. Not taking the possible faultyness +of the chain providers into account, can break determinism altogether. + +And the biggest problem today is that, to spot these POI differences, we have +to index subgraphs that **use those values in their mappings**. If by any chance +in Firehose shootout we've done in the **integration cluster**, there were no +subgraphs using these values **we wouldn't spot any POI differences**, which +is a **very severe issue**. + +POI differences described in the Firehose shootout for reference: +https://gist.github.com/evaporei/660e57d95e6140ca877f338426cea200. + +So in summary, the problems being described above are: + +- That currently we consider the **chain provider** as a **source of truth**, + which can only be questioned in behalf of re-orgs; +- We don't have a good way to compare provider input (that could spot POI + differences) without the indirection of a subgraph mapping. + +## Urgency + +The urgency per se is low because we can spot POI differences by using +subgraphs themselves, but in the long run it's very likely we'll need some sort +of way to compare provider input between nodes/indexers so that we're more +efficient and accurate when testing new providers or forms of fetching chain +data (eg: Firehose). + +## Terminology + +`bth` is a candidate table name for "block triggers hash" (up to change). + +## Detailed Design + +A subgraph can have multiple handlers, each of them can have as their parameter +one of the following types (in Ethereum): + +- Event log; +- Call; +- Block. + +Currently this is the data structure that we use in code for representing what +gets into a mapping handler ([source](https://github.com/graphprotocol/graph-node/blob/b8faa45fc96f53226648ca65ddacff164b75e018/chain/ethereum/src/trigger.rs#L44)): + +```rust +pub enum MappingTrigger { + Log { + block: Arc, + transaction: Arc, + log: Arc, + params: Vec, + receipt: Option>, + }, + Call { + block: Arc, + transaction: Arc, + call: Arc, + inputs: Vec, + outputs: Vec, + }, + Block { + block: Arc, + }, +} +``` + +> Note: This actually gets converted to either an `AscEthereumLog`, +> `AscEthereumCall` or `AscEthereumBlock` to be called correctly by +> the subgraph's AssemblyScript mapping/handler code. + +This data that gets filled by whatever provider the node/indexer is using, which +can be from **JSON-RPC** or the **Firehose**. + +> Note: Each chain has their `MappingTrigger` definition, we're using Ethereum's +> for simplicity. + +The code that actually executes the mapping code for each trigger candidate +looks like this (with a ton of simplification): + +```rust +fn process_triggers( + &self, + triggers: Vec, +) -> Result<(), MappingError> { + for trigger in triggers { + self.process_trigger(&trigger)?; + } + + Ok(()) +} +``` + +This is executed per relevant block, given the subgraph's data sources +(contracts that it's "listening"). To give a bit of a perspective in numbers, +the `uniswap-v2` subgraph for example usually has around ~30 triggers per block. + +The proposal here would be to hash this vector per block and save in a database +record (under the subgraph namespace `sgdX`), resulting in a value that could +look like this: + +``` +id | 61c7e6beb2cda15ca38fb09506abecd6429f53ce985b9509a0319ee9bb3b013d +subgraph_id | QmZwNVFxXjUcSD42n5Lztg2D4DorZLUPGv8gTAXQpf5Wy9 +block_number | 22062424 +block_hash | \x92cb334274a3f78c4a2bab96853992deca00fd57ee8c44c69c93dfd9f012f707 +digest | \x2b6ecfca88033617cf5bfbfd332ba593591bab1a955e049271d2fd7696a55a9e5d4024282b96a639f2077d0cdaa99000734768ae106498a8514f0819b64fc8f65c9637dd788e472e3287eb15e3945338c27e38eaec8a27268aca8cbb45f59d7cbb9893611b9e56f4850dc5696673f7a8c7b695d2d9127811856e1ccc76a942a29dca9e1a92287c807470c3dfd7d3f0fffeb66d29a22278a7746ddf4fb9a6f76a6ac11dd98cc65f5bfacda4ebdbca51e161fadaffd3434594886ded4301929e9a6098995277c9b36513a0592e06d0a6a8035506c281bd7e0e14e01149a9122cfd6e980f65519a8eb28b8612c245ef66f671761361d4317d53c9e4af03ed4999a0 +provider | mainnet-0 +created_at | 2022-05-13 22:27:08.304456+00 +``` + +Technical considerations: + +- It will probably make sense to first convert this `Vec` into another + data structure, where there's only one `LightEthereumBlock` and all repeated + data gets converted into a `Set` (eg: `BTreeSet`); +- On the hash itself + - The `digest` field can be a lot smaller, the value here is from a POI, + which we don't need to keep the same properties; + - It should be calculated in a separate thread/future to the mapping code + execution, so it doesn't degrade performance; +- On the database side + - As mentioned before, this data should go under the subgraph's namespaced + database, aka `sgdX`, just as where the current POI table stands `poi2$`; + - The table name could be something along the lines of `bth` (block triggers + hash). + +This value could then be exposed via the `indexing status API`, just like the +POI, but it could be used by indexers (or us) to compare if the data +`graph-node` fetched, given a `provider`, `subgraph` and `block` has the same +result. + +We could also add a utility `graphman` command that fetches that specific data +for the `provider`, `subgraph` and `block`. This data could be either be +compared with other providers in `graph-node` itself, or the task could perhaps +be delegated to the new `graph-ixi` cross checking tool. + +This would allow us (and indexers!) to be: + +- Confident that different providers give the same results; +- Give another tool to catch determinism bugs (very important given the + properties we want to maintain); +- And perhaps for the future, add value to the network itself outside the + side utility. + +## Compatibility + +There are no backwards compatibility concerns. + +## Drawbacks and Risks + +The only point that should be well thought of in the implementation side is +how to calculate the `Vec` hash efficiently, so that it doesn't go +over the same data more than once (eg: same block). + +Also, the new database table should keep the same properties that the other +subgraph's entity (and POI) tables have, which have their data "re-made" on +re-orgs, rewinds, etc. + +## Alternatives + +We could store each `MappingTrigger` hash in a separate database recored, +however that would create many more database results than necessary in my +opinion. This would make more sense if we were actually storing the data +itself, not a hash of it. + +## Open Questions + +- What hash algorithm to use? +- What table name should we use? +- Do we want to store any more provider data? + +> Note: Ethereum calls aren't being included in this RFC, however since we +> already store those in a "call cache" table, we can just hash the stored +> data, perhaps in a different table.