rfcs: Add Block Triggers hash #26

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

evaporei wants to merge 2 commits into master from eva/provider-hash

Contributor

evaporei commented May 16, 2022 •

edited

Loading

Addresses graphprotocol/graph-node#3554


          rfcs: Add Block Triggers hash

evaporei self-assigned this

vercel bot commented May 16, 2022 •

edited

Loading

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
rfcs	✅ Ready (Inspect)	Visit Preview	Jun 13, 2022 at 10:55PM (UTC)

vercel bot deployed to Preview

May 16, 2022 15:12

View deployment

evaporei requested review from That3Percent and lutter

May 17, 2022 04:23

That3Percent suggested changes

View reviewed changes

Contributor

That3Percent left a comment

Thanks for your thoughtful RFC @evaporei. I like this direction but have some considerations for you. Let me know what you think.

rfcs/0009-block-triggers-hash.md


		Technical considerations:

		- It will probably make sense to first convert this `Vec<Trigger>` into another

Contributor

That3Percent May 23, 2022

Why will it make sense to convert an ordered collection to an unordered collection here?

Contributor Author

evaporei Jun 13, 2022

The idea with the Sets and with this separate structure is to facilitate hashing. Right now all Triggers have an Arc that's pointing to the same Block. IMO we should have a single structure that could be traversed once, without having to hash the same data twice.

rfcs/0009-block-triggers-hash.md

+              subgraph_id  | QmZwNVFxXjUcSD42n5Lztg2D4DorZLUPGv8gTAXQpf5Wy9
+              block_number | 22062424
+              block_hash   | \x92cb334274a3f78c4a2bab96853992deca00fd57ee8c44c69c93dfd9f012f707
+              digest       | \x2b6ecfca88033617cf5bfbfd332ba593591bab1a955e049271d2fd7696a55a9e5d4024282b96a639f2077d0cdaa99000734768ae106498a8514f0819b64fc8f65c9637dd788e472e3287eb15e3945338c27e38eaec8a27268aca8cbb45f59d7cbb9893611b9e56f4850dc5696673f7a8c7b695d2d9127811856e1ccc76a942a29dca9e1a92287c807470c3dfd7d3f0fffeb66d29a22278a7746ddf4fb9a6f76a6ac11dd98cc65f5bfacda4ebdbca51e161fadaffd3434594886ded4301929e9a6098995277c9b36513a0592e06d0a6a8035506c281bd7e0e14e01149a9122cfd6e980f65519a8eb28b8612c245ef66f671761361d4317d53c9e4af03ed4999a0

Contributor

That3Percent May 23, 2022

Separately, we are looking to deploy a faster and more determinism-sensitive PoI. When using PoIv2 we could roll this information into the PoI instead of maintaining a separate table. There are tradeoffs to consider here, but it is worth considering. Eg: Use a separate table as-described when using PoIv1, and roll the hash into the canonical PoI when PoIv2 is enabled. That removes a table and allows determinism issues to be detected without a separate protocol. What do you think?

rfcs/0009-block-triggers-hash.md Outdated Show resolved Hide resolved

rfcs/0009-block-triggers-hash.md

+              - On the hash itself
+                - The `digest` field can be a lot smaller, the value here is from a POI,
+                  which we don't need to keep the same properties;
+                - It should be calculated in a separate thread/future to the mapping code

Contributor

That3Percent May 23, 2022

As an a-side, the PoI can be calculated in a separate thread as well. But, I'm generally skeptical of using threads in this way within graph-node.

The maximum amount of parallelism that can be achieved is relative to the number of cores. We already have a unit of parallelism exploited natively when indexing in a production environment (the subgraphs themselves). Once that limit is reached (for example, by indexing more than num_cores worth of subgraphs assuming they are CPU bound), any additional parallelism will actually degrade performance because it will increase overhead for making everything threaded (eg: message passing, locks, context switching, etc). If pipe-lining is done well, only the slowest operation can be optimized.

It's hard to say categorically that this is "wrong" because it's context-dependent, but performance tweaks like threading and caching need to be handled with care because they are not magical and often cause the performance problems they are trying to solve.

Contributor

That3Percent May 23, 2022

Another consideration for using threads is that there is overhead for converting values to owned, which is needed because passing references across threads rightfully angers the borrow checker. Consider: with modern hash functions that operate at the limit of memory bandwidth it is generally faster to hash a string than to clone it! You can't pass references across threads, so we're getting into Arc or Clone territory, which adds non-negligible overhead.

Contributor Author

evaporei Jun 13, 2022

That's fair, I suggested this more to not block indexing. This could change if we follow @lutter 's suggestion about making this determinism trace for debugging purposes, so it wouldn't affect graph-node in the production "mode".

rfcs/0009-block-triggers-hash.md Outdated Show resolved Hide resolved

rfcs/0009-block-triggers-hash.md

+                data structure, where there's only one `LightEthereumBlock` and all repeated
+                data gets converted into a `Set` (eg: `BTreeSet`);
+              - On the hash itself
+                - The `digest` field can be a lot smaller, the value here is from a POI,

Contributor

That3Percent May 23, 2022

The new PoI has a much smaller "intermediate digest". Don't quote me on this but I think it's about a 20x improvement. We should be able to use the new PoI methods whether or not PoIv2 is enabled (see above) by always using the latest stable-hash when a separate table is used.

I'm not yet convinced that the value can be smaller here if we need a multi-set hash. The reason we use a multi-set hash is because when we have multiple data sources (for example, IPFS) and those data sources have unreliable availability, we need an efficient means to remove values from the hash. The seeming bloat here exists to have a high-quality hash which affords that feature. (The state space of a multi-set hash needs to be larger than the final digest, and we need to store the hasher state instead of the digest).

If we need to do the same thing (for example, an IPFS data source makes an RPC call) then we'll need PoI-like semantics. It may also be convenient to re-use stable-hash anyway.

Contributor Author

evaporei Jun 13, 2022

Ok!

Contributor

That3Percent commented May 23, 2022

Also linking here: graphprotocol/graph-node#3554 (comment) as an alternative to prevent determinism issues with verifiable queries rather than detect them after the fact.

lutter reviewed

View reviewed changes

rfcs/0009-block-triggers-hash.md

+              digest       | \x2b6ecfca88033617cf5bfbfd332ba593591bab1a955e049271d2fd7696a55a9e5d4024282b96a639f2077d0cdaa99000734768ae106498a8514f0819b64fc8f65c9637dd788e472e3287eb15e3945338c27e38eaec8a27268aca8cbb45f59d7cbb9893611b9e56f4850dc5696673f7a8c7b695d2d9127811856e1ccc76a942a29dca9e1a92287c807470c3dfd7d3f0fffeb66d29a22278a7746ddf4fb9a6f76a6ac11dd98cc65f5bfacda4ebdbca51e161fadaffd3434594886ded4301929e9a6098995277c9b36513a0592e06d0a6a8035506c281bd7e0e14e01149a9122cfd6e980f65519a8eb28b8612c245ef66f671761361d4317d53c9e4af03ed4999a0
+              provider     | mainnet-0
+              created_at   | 2022-05-13 22:27:08.304456+00
+              ```

Contributor

lutter May 24, 2022

Regarding the storage format:

we don't need id and subgraph_id, they are implicit from the schema name sgdNNNN
we only need the block number, not the hash, since in the context of a subgraph the block number is unambiguous (it must be on the chain going back from the subgraph head)
provider is a bit tricky: users can change the endpoint in the configuration and keep the same provider name. If we want to track that, we should have a global providers(id: int4, url, label) table in the primary and store the provider id here
I am not sure how useful the created_at column would be; it will add 8 bytes for each row

I would actually just use the current poi2$ table and add columns provider_id, created_at (maybe), and input_digest or some such

Contributor Author

evaporei Jun 13, 2022

Nice idea on the subgraph_id! I was thinking about the created_at in the sense of not trusting that provider data will be immutable over time.

rfcs/0009-block-triggers-hash.md

+              to index subgraphs that **use those values in their mappings**. If by any chance
+              in Firehose shootout we've done in the **integration cluster**, there were no
+              subgraphs using these values **we wouldn't spot any POI differences**, which
+              is a **very severe issue**.

Contributor

lutter May 24, 2022

The examples given only pertain to block-level attributes. For those, it might make sense to actually store some sort of digest for each block per provider (instead of per subgraph)

The differences that are usually the most difficult to spot are issues with eth_call which are AFAICT not covered by this proposal, but it's reasonable to assume that divergence there leads to divergence in the PoI.

Since this would all be more for forensics than to support regular queries, another approach to storing this data would be to standardize the logging format that indexing uses better, and have tooling that can compare two indexing logs in a convenient way, rather than store it in the database.

How do you see this mechanism be used? From the description it sounds that it would mostly be for testing graph-node itself and/or checking different providers, i.e. more a test tool than something that is needed in every production installation of graph-node. My take is that in production, it doesn't really matter that a provider supplies diverging data as long as it doesn't affect the PoI.

If this is for provider testing, a tool that more directly exercises providers and compares their responses might actually be more usable and useful. That tool could, e.g., be a special indexing mode that runs the block stream against two (or more?) providers and compares their responses; making that more explicit would have the advantage that error message could clearly state what part of the response is diverging, rather than just producing two separate digests and forcing us to investigate what caused the divergence.

Long story short: it would be good to explain how the data that is collected here would be used, and what sort of operations it needs to enable.

Contributor

That3Percent May 24, 2022 •

edited

Loading

another approach to storing this data would be to standardize the logging format that indexing uses better, and have tooling that can compare two indexing logs in a convenient way

Jannis and I actually already implemented this back in the day. There is an environment variable GRAPH_LOG_POI_EVENTS to turn on PoI logging. The idea is that the hash is an optimistic check with lower performance overhead that allows you to index and compare a single value at the end, and only if there is a discrepancy perform a binary search to find the offending block then try again with logging. It's not an either/or because what can be logged can be hashed.

Contributor

That3Percent commented May 24, 2022

@lutter Agree on all points.


          rfcs: Fix Block Triggers hash typos

f0f3888

Contributor Author

evaporei commented Jun 13, 2022

@That3Percent @lutter thanks for the review and great considerations! After I investigate a few determinism issues that showed up recently I'll come back to this and see what I can improve on this.

vercel bot deployed to Preview

June 13, 2022 22:55

View deployment

evaporei mentioned this pull request

Block Triggers hash write-up graphprotocol/graph-node#3554

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet