Skip to content

Conversation

@evaporei
Copy link
Contributor

@evaporei evaporei commented May 16, 2022

@evaporei evaporei self-assigned this May 16, 2022
@vercel
Copy link

vercel bot commented May 16, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
rfcs ✅ Ready (Inspect) Visit Preview Jun 13, 2022 at 10:55PM (UTC)

Copy link
Contributor

@That3Percent That3Percent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your thoughtful RFC @evaporei. I like this direction but have some considerations for you. Let me know what you think.


Technical considerations:

- It will probably make sense to first convert this `Vec<Trigger>` into another
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why will it make sense to convert an ordered collection to an unordered collection here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea with the Sets and with this separate structure is to facilitate hashing. Right now all Triggers have an Arc that's pointing to the same Block. IMO we should have a single structure that could be traversed once, without having to hash the same data twice.

subgraph_id | QmZwNVFxXjUcSD42n5Lztg2D4DorZLUPGv8gTAXQpf5Wy9
block_number | 22062424
block_hash | \x92cb334274a3f78c4a2bab96853992deca00fd57ee8c44c69c93dfd9f012f707
digest | \x2b6ecfca88033617cf5bfbfd332ba593591bab1a955e049271d2fd7696a55a9e5d4024282b96a639f2077d0cdaa99000734768ae106498a8514f0819b64fc8f65c9637dd788e472e3287eb15e3945338c27e38eaec8a27268aca8cbb45f59d7cbb9893611b9e56f4850dc5696673f7a8c7b695d2d9127811856e1ccc76a942a29dca9e1a92287c807470c3dfd7d3f0fffeb66d29a22278a7746ddf4fb9a6f76a6ac11dd98cc65f5bfacda4ebdbca51e161fadaffd3434594886ded4301929e9a6098995277c9b36513a0592e06d0a6a8035506c281bd7e0e14e01149a9122cfd6e980f65519a8eb28b8612c245ef66f671761361d4317d53c9e4af03ed4999a0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately, we are looking to deploy a faster and more determinism-sensitive PoI. When using PoIv2 we could roll this information into the PoI instead of maintaining a separate table. There are tradeoffs to consider here, but it is worth considering. Eg: Use a separate table as-described when using PoIv1, and roll the hash into the canonical PoI when PoIv2 is enabled. That removes a table and allows determinism issues to be detected without a separate protocol. What do you think?

- On the hash itself
- The `digest` field can be a lot smaller, the value here is from a POI,
which we don't need to keep the same properties;
- It should be calculated in a separate thread/future to the mapping code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an a-side, the PoI can be calculated in a separate thread as well. But, I'm generally skeptical of using threads in this way within graph-node.

The maximum amount of parallelism that can be achieved is relative to the number of cores. We already have a unit of parallelism exploited natively when indexing in a production environment (the subgraphs themselves). Once that limit is reached (for example, by indexing more than num_cores worth of subgraphs assuming they are CPU bound), any additional parallelism will actually degrade performance because it will increase overhead for making everything threaded (eg: message passing, locks, context switching, etc). If pipe-lining is done well, only the slowest operation can be optimized.

It's hard to say categorically that this is "wrong" because it's context-dependent, but performance tweaks like threading and caching need to be handled with care because they are not magical and often cause the performance problems they are trying to solve.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another consideration for using threads is that there is overhead for converting values to owned, which is needed because passing references across threads rightfully angers the borrow checker. Consider: with modern hash functions that operate at the limit of memory bandwidth it is generally faster to hash a string than to clone it! You can't pass references across threads, so we're getting into Arc or Clone territory, which adds non-negligible overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, I suggested this more to not block indexing. This could change if we follow @lutter 's suggestion about making this determinism trace for debugging purposes, so it wouldn't affect graph-node in the production "mode".

data structure, where there's only one `LightEthereumBlock` and all repeated
data gets converted into a `Set` (eg: `BTreeSet`);
- On the hash itself
- The `digest` field can be a lot smaller, the value here is from a POI,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new PoI has a much smaller "intermediate digest". Don't quote me on this but I think it's about a 20x improvement. We should be able to use the new PoI methods whether or not PoIv2 is enabled (see above) by always using the latest stable-hash when a separate table is used.

I'm not yet convinced that the value can be smaller here if we need a multi-set hash. The reason we use a multi-set hash is because when we have multiple data sources (for example, IPFS) and those data sources have unreliable availability, we need an efficient means to remove values from the hash. The seeming bloat here exists to have a high-quality hash which affords that feature. (The state space of a multi-set hash needs to be larger than the final digest, and we need to store the hasher state instead of the digest).

If we need to do the same thing (for example, an IPFS data source makes an RPC call) then we'll need PoI-like semantics. It may also be convenient to re-use stable-hash anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok!

@That3Percent
Copy link
Contributor

Also linking here: graphprotocol/graph-node#3554 (comment) as an alternative to prevent determinism issues with verifiable queries rather than detect them after the fact.

digest | \x2b6ecfca88033617cf5bfbfd332ba593591bab1a955e049271d2fd7696a55a9e5d4024282b96a639f2077d0cdaa99000734768ae106498a8514f0819b64fc8f65c9637dd788e472e3287eb15e3945338c27e38eaec8a27268aca8cbb45f59d7cbb9893611b9e56f4850dc5696673f7a8c7b695d2d9127811856e1ccc76a942a29dca9e1a92287c807470c3dfd7d3f0fffeb66d29a22278a7746ddf4fb9a6f76a6ac11dd98cc65f5bfacda4ebdbca51e161fadaffd3434594886ded4301929e9a6098995277c9b36513a0592e06d0a6a8035506c281bd7e0e14e01149a9122cfd6e980f65519a8eb28b8612c245ef66f671761361d4317d53c9e4af03ed4999a0
provider | mainnet-0
created_at | 2022-05-13 22:27:08.304456+00
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the storage format:

  • we don't need id and subgraph_id, they are implicit from the schema name sgdNNNN
  • we only need the block number, not the hash, since in the context of a subgraph the block number is unambiguous (it must be on the chain going back from the subgraph head)
  • provider is a bit tricky: users can change the endpoint in the configuration and keep the same provider name. If we want to track that, we should have a global providers(id: int4, url, label) table in the primary and store the provider id here
  • I am not sure how useful the created_at column would be; it will add 8 bytes for each row

I would actually just use the current poi2$ table and add columns provider_id, created_at (maybe), and input_digest or some such

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea on the subgraph_id! I was thinking about the created_at in the sense of not trusting that provider data will be immutable over time.

to index subgraphs that **use those values in their mappings**. If by any chance
in Firehose shootout we've done in the **integration cluster**, there were no
subgraphs using these values **we wouldn't spot any POI differences**, which
is a **very severe issue**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples given only pertain to block-level attributes. For those, it might make sense to actually store some sort of digest for each block per provider (instead of per subgraph)

The differences that are usually the most difficult to spot are issues with eth_call which are AFAICT not covered by this proposal, but it's reasonable to assume that divergence there leads to divergence in the PoI.

Since this would all be more for forensics than to support regular queries, another approach to storing this data would be to standardize the logging format that indexing uses better, and have tooling that can compare two indexing logs in a convenient way, rather than store it in the database.

How do you see this mechanism be used? From the description it sounds that it would mostly be for testing graph-node itself and/or checking different providers, i.e. more a test tool than something that is needed in every production installation of graph-node. My take is that in production, it doesn't really matter that a provider supplies diverging data as long as it doesn't affect the PoI.

If this is for provider testing, a tool that more directly exercises providers and compares their responses might actually be more usable and useful. That tool could, e.g., be a special indexing mode that runs the block stream against two (or more?) providers and compares their responses; making that more explicit would have the advantage that error message could clearly state what part of the response is diverging, rather than just producing two separate digests and forcing us to investigate what caused the divergence.

Long story short: it would be good to explain how the data that is collected here would be used, and what sort of operations it needs to enable.

Copy link
Contributor

@That3Percent That3Percent May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lutter

another approach to storing this data would be to standardize the logging format that indexing uses better, and have tooling that can compare two indexing logs in a convenient way

Jannis and I actually already implemented this back in the day. There is an environment variable GRAPH_LOG_POI_EVENTS to turn on PoI logging. The idea is that the hash is an optimistic check with lower performance overhead that allows you to index and compare a single value at the end, and only if there is a discrepancy perform a binary search to find the offending block then try again with logging. It's not an either/or because what can be logged can be hashed.

@That3Percent
Copy link
Contributor

@lutter Agree on all points.

@evaporei
Copy link
Contributor Author

@That3Percent @lutter thanks for the review and great considerations! After I investigate a few determinism issues that showed up recently I'll come back to this and see what I can improve on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants