Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
226 changes: 226 additions & 0 deletions rfcs/0009-block-triggers-hash.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
# RFC-0009: Block Triggers hash

<dl>
<dt>Author</dt>
<dd>Eva Paulino Pace</dd>

<dt>RFC pull request</dt>
<dd><a href="https://github.com/graphprotocol/rfcs/pull/26">https://github.com/graphprotocol/rfcs/pull/26</a></dd>

<dt>Date of submission</dt>
<dd>2022-05-16</dd>

<dt>Date of approval</dt>
<dd>YYYY-MM-DD</dd>
</dl>

## Contents

<!-- toc -->

## Summary

The "Block Triggers hash" will allow indexers to compare their subgraph's main
source of entropy, the chain provider, by storing a hash of the fetched data
that gets into any subgraph's mapping code.

## Goals & Motivation

One of The Graph's coolest features is that queries are deterministic, and
given a `Qm` subgraph hash, indexing it should **always** give the same result.
This is possible because we inherit the blockchain's determinism property,
however there's a big loophole which can break this amazing feature, which is
**the chain provider**.

Currently the main (or only) type of connection we give as option to indexers
(in The Graph Network) is the **JSON-RPC** one. To use it, they can either run a
node themselves or use a third party service like Alchemy. Either way the
provider can be faulty and give incorrect results for a number of different
reasons.

To be a little more specific, let's say there are indexers/nodes `A` and `B`.
Both are indexing subgraph `Z`. Indexer `A` is using Alchemy and `B` is using
Infura.

Given a block `14_722_714` of a determined hash, both providers will very
likely give the same result for these two values (block number and hash),
however other fields such as `gas_used` or `total_difficulty` could be
incorrect. And yes, ideally they would always be correct since they are chain
providers, that's their main job, however what I'm describing is the exact
issue we've faced when testing indexing Ethereum mainnet with the Firehose.

These field/value differences between providers are directly fed into the
subgraph mappings, which are the current input of the POI algorithm and the
base of The Graph's determinism property. Not taking the possible faultyness
of the chain providers into account, can break determinism altogether.

And the biggest problem today is that, to spot these POI differences, we have
to index subgraphs that **use those values in their mappings**. If by any chance
in Firehose shootout we've done in the **integration cluster**, there were no
subgraphs using these values **we wouldn't spot any POI differences**, which
is a **very severe issue**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples given only pertain to block-level attributes. For those, it might make sense to actually store some sort of digest for each block per provider (instead of per subgraph)

The differences that are usually the most difficult to spot are issues with eth_call which are AFAICT not covered by this proposal, but it's reasonable to assume that divergence there leads to divergence in the PoI.

Since this would all be more for forensics than to support regular queries, another approach to storing this data would be to standardize the logging format that indexing uses better, and have tooling that can compare two indexing logs in a convenient way, rather than store it in the database.

How do you see this mechanism be used? From the description it sounds that it would mostly be for testing graph-node itself and/or checking different providers, i.e. more a test tool than something that is needed in every production installation of graph-node. My take is that in production, it doesn't really matter that a provider supplies diverging data as long as it doesn't affect the PoI.

If this is for provider testing, a tool that more directly exercises providers and compares their responses might actually be more usable and useful. That tool could, e.g., be a special indexing mode that runs the block stream against two (or more?) providers and compares their responses; making that more explicit would have the advantage that error message could clearly state what part of the response is diverging, rather than just producing two separate digests and forcing us to investigate what caused the divergence.

Long story short: it would be good to explain how the data that is collected here would be used, and what sort of operations it needs to enable.

Copy link
Contributor

@That3Percent That3Percent May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lutter

another approach to storing this data would be to standardize the logging format that indexing uses better, and have tooling that can compare two indexing logs in a convenient way

Jannis and I actually already implemented this back in the day. There is an environment variable GRAPH_LOG_POI_EVENTS to turn on PoI logging. The idea is that the hash is an optimistic check with lower performance overhead that allows you to index and compare a single value at the end, and only if there is a discrepancy perform a binary search to find the offending block then try again with logging. It's not an either/or because what can be logged can be hashed.


POI differences described in the Firehose shootout for reference:
https://gist.github.com/evaporei/660e57d95e6140ca877f338426cea200.

So in summary, the problems being described above are:

- That currently we consider the **chain provider** as a **source of truth**,
which can only be questioned in behalf of re-orgs;
- We don't have a good way to compare provider input (that could spot POI
differences) without the indirection of a subgraph mapping.

## Urgency

The urgency per se is low because we can spot POI differences by using
subgraphs themselves, but in the long run it's very likely we'll need some sort
of way to compare provider input between nodes/indexers so that we're more
efficient and accurate when testing new providers or forms of fetching chain
data (eg: Firehose).

## Terminology

`bth` is a candidate table name for "block triggers hash" (up to change).

## Detailed Design

A subgraph can have multiple handlers, each of them can have as their parameter
one of the following types (in Ethereum):

- Event log;
- Call;
- Block.

Currently this is the data structure that we use in code for representing what
gets into a mapping handler ([source](https://github.com/graphprotocol/graph-node/blob/b8faa45fc96f53226648ca65ddacff164b75e018/chain/ethereum/src/trigger.rs#L44)):

```rust
pub enum MappingTrigger {
Log {
block: Arc<LightEthereumBlock>,
transaction: Arc<Transaction>,
log: Arc<Log>,
params: Vec<LogParam>,
receipt: Option<Arc<TransactionReceipt>>,
},
Call {
block: Arc<LightEthereumBlock>,
transaction: Arc<Transaction>,
call: Arc<EthereumCall>,
inputs: Vec<LogParam>,
outputs: Vec<LogParam>,
},
Block {
block: Arc<LightEthereumBlock>,
},
}
```

> Note: This actually gets converted to either an `AscEthereumLog`,
> `AscEthereumCall` or `AscEthereumBlock` to be called correctly by
> the subgraph's AssemblyScript mapping/handler code.

This data that gets filled by whatever provider the node/indexer is using, which
can be from **JSON-RPC** or the **Firehose**.

> Note: Each chain has their `MappingTrigger` definition, we're using Ethereum's
> for simplicity.

The code that actually executes the mapping code for each trigger candidate
looks like this (with a ton of simplification):

```rust
fn process_triggers(
&self,
triggers: Vec<C::TriggerData>,
) -> Result<(), MappingError> {
for trigger in triggers {
self.process_trigger(&trigger)?;
}

Ok(())
}
```

This is executed per relevant block, given the subgraph's data sources
(contracts that it's "listening"). To give a bit of a perspective in numbers,
the `uniswap-v2` subgraph for example usually has around ~30 triggers per block.

The proposal here would be to hash this vector per block and save in a database
record (under the subgraph namespace `sgdX`), resulting in a value that could
look like this:

```
id | 61c7e6beb2cda15ca38fb09506abecd6429f53ce985b9509a0319ee9bb3b013d
subgraph_id | QmZwNVFxXjUcSD42n5Lztg2D4DorZLUPGv8gTAXQpf5Wy9
block_number | 22062424
block_hash | \x92cb334274a3f78c4a2bab96853992deca00fd57ee8c44c69c93dfd9f012f707
digest | \x2b6ecfca88033617cf5bfbfd332ba593591bab1a955e049271d2fd7696a55a9e5d4024282b96a639f2077d0cdaa99000734768ae106498a8514f0819b64fc8f65c9637dd788e472e3287eb15e3945338c27e38eaec8a27268aca8cbb45f59d7cbb9893611b9e56f4850dc5696673f7a8c7b695d2d9127811856e1ccc76a942a29dca9e1a92287c807470c3dfd7d3f0fffeb66d29a22278a7746ddf4fb9a6f76a6ac11dd98cc65f5bfacda4ebdbca51e161fadaffd3434594886ded4301929e9a6098995277c9b36513a0592e06d0a6a8035506c281bd7e0e14e01149a9122cfd6e980f65519a8eb28b8612c245ef66f671761361d4317d53c9e4af03ed4999a0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately, we are looking to deploy a faster and more determinism-sensitive PoI. When using PoIv2 we could roll this information into the PoI instead of maintaining a separate table. There are tradeoffs to consider here, but it is worth considering. Eg: Use a separate table as-described when using PoIv1, and roll the hash into the canonical PoI when PoIv2 is enabled. That removes a table and allows determinism issues to be detected without a separate protocol. What do you think?

provider | mainnet-0
created_at | 2022-05-13 22:27:08.304456+00
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the storage format:

  • we don't need id and subgraph_id, they are implicit from the schema name sgdNNNN
  • we only need the block number, not the hash, since in the context of a subgraph the block number is unambiguous (it must be on the chain going back from the subgraph head)
  • provider is a bit tricky: users can change the endpoint in the configuration and keep the same provider name. If we want to track that, we should have a global providers(id: int4, url, label) table in the primary and store the provider id here
  • I am not sure how useful the created_at column would be; it will add 8 bytes for each row

I would actually just use the current poi2$ table and add columns provider_id, created_at (maybe), and input_digest or some such

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea on the subgraph_id! I was thinking about the created_at in the sense of not trusting that provider data will be immutable over time.


Technical considerations:

- It will probably make sense to first convert this `Vec<Trigger>` into another
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why will it make sense to convert an ordered collection to an unordered collection here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea with the Sets and with this separate structure is to facilitate hashing. Right now all Triggers have an Arc that's pointing to the same Block. IMO we should have a single structure that could be traversed once, without having to hash the same data twice.

data structure, where there's only one `LightEthereumBlock` and all repeated
data gets converted into a `Set` (eg: `BTreeSet`);
- On the hash itself
- The `digest` field can be a lot smaller, the value here is from a POI,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new PoI has a much smaller "intermediate digest". Don't quote me on this but I think it's about a 20x improvement. We should be able to use the new PoI methods whether or not PoIv2 is enabled (see above) by always using the latest stable-hash when a separate table is used.

I'm not yet convinced that the value can be smaller here if we need a multi-set hash. The reason we use a multi-set hash is because when we have multiple data sources (for example, IPFS) and those data sources have unreliable availability, we need an efficient means to remove values from the hash. The seeming bloat here exists to have a high-quality hash which affords that feature. (The state space of a multi-set hash needs to be larger than the final digest, and we need to store the hasher state instead of the digest).

If we need to do the same thing (for example, an IPFS data source makes an RPC call) then we'll need PoI-like semantics. It may also be convenient to re-use stable-hash anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok!

which we don't need to keep the same properties;
- It should be calculated in a separate thread/future to the mapping code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an a-side, the PoI can be calculated in a separate thread as well. But, I'm generally skeptical of using threads in this way within graph-node.

The maximum amount of parallelism that can be achieved is relative to the number of cores. We already have a unit of parallelism exploited natively when indexing in a production environment (the subgraphs themselves). Once that limit is reached (for example, by indexing more than num_cores worth of subgraphs assuming they are CPU bound), any additional parallelism will actually degrade performance because it will increase overhead for making everything threaded (eg: message passing, locks, context switching, etc). If pipe-lining is done well, only the slowest operation can be optimized.

It's hard to say categorically that this is "wrong" because it's context-dependent, but performance tweaks like threading and caching need to be handled with care because they are not magical and often cause the performance problems they are trying to solve.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another consideration for using threads is that there is overhead for converting values to owned, which is needed because passing references across threads rightfully angers the borrow checker. Consider: with modern hash functions that operate at the limit of memory bandwidth it is generally faster to hash a string than to clone it! You can't pass references across threads, so we're getting into Arc or Clone territory, which adds non-negligible overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, I suggested this more to not block indexing. This could change if we follow @lutter 's suggestion about making this determinism trace for debugging purposes, so it wouldn't affect graph-node in the production "mode".

execution, so it doesn't degrade performance;
- On the database side
- As mentioned before, this data should go under the subgraph's namespaced
database, aka `sgdX`, just as where the current POI table stands `poi2$`;
- The table name could be something along the lines of `bth` (block triggers
hash).

This value could then be exposed via the `indexing status API`, just like the
POI, but it could be used by indexers (or us) to compare if the data
`graph-node` fetched, given a `provider`, `subgraph` and `block` has the same
result.

We could also add a utility `graphman` command that fetches that specific data
for the `provider`, `subgraph` and `block`. This data could be either be
compared with other providers in `graph-node` itself, or the task could perhaps
be delegated to the new `graph-ixi` cross checking tool.

This would allow us (and indexers!) to be:

- Confident that different providers give the same results;
- Give another tool to catch determinism bugs (very important given the
properties we want to maintain);
- And perhaps for the future, add value to the network itself outside the
side utility.

## Compatibility

There are no backwards compatibility concerns.

## Drawbacks and Risks

The only point that should be well thought of in the implementation side is
how to calculate the `Vec<Trigger>` hash efficiently, so that it doesn't go
over the same data more than once (eg: same block).

Also, the new database table should keep the same properties that the other
subgraph's entity (and POI) tables have, which have their data "re-made" on
re-orgs, rewinds, etc.

## Alternatives

We could store each `MappingTrigger` hash in a separate database recored,
however that would create many more database results than necessary in my
opinion. This would make more sense if we were actually storing the data
itself, not a hash of it.

## Open Questions

- What hash algorithm to use?
- What table name should we use?
- Do we want to store any more provider data?

> Note: Ethereum calls aren't being included in this RFC, however since we
> already store those in a "call cache" table, we can just hash the stored
> data, perhaps in a different table.