Improve type checking and constraint checking; treat input as set.#673
Improve type checking and constraint checking; treat input as set.#673ncalexan wants to merge 9 commits into
Conversation
| /// `TypedValue`. | ||
| /// | ||
| pub type ValueRc<T> = Rc<T>; | ||
| pub type ValueRc<T> = Arc<T>; |
There was a problem hiding this comment.
Can we make this its own tiny PR?
| conflicting_upserts: BTreeMap<TempId, BTreeSet<KnownEntid>>, | ||
| }, | ||
|
|
||
| /// A transaction tried to assert a datom or datoms with the wrong value `v type(s). |
There was a problem hiding this comment.
You're missing a backtick.
| @@ -0,0 +1,46 @@ | |||
| // Copyright 2016 Mozilla | |||
| } | ||
| }).collect(); | ||
|
|
||
| errors |
There was a problem hiding this comment.
If you put errors.collect() here instead of on 43, you don't need the type declaration on 33.
| let errors: BTreeMap<_, _> = terms.into_iter().filter_map(|term| { | ||
| match term { | ||
| &Term::AddOrRetract(_, e, (a, attribute), ref v) => { | ||
| if attribute.value_type != v.value_type() { |
There was a problem hiding this comment.
I suspect that for programmatic usage we should fail quickly rather than slowly; walking every term one more time seems like a worse choice than inserting a check at the point that we construct the AttributedTerm. Thoughts?
There was a problem hiding this comment.
I've reworked this to use a trie, and to get rid of AttributedTerm altogether. (I was trying to mix filter_map with Result, which is hard -- hence AttributedTerm. Without filter_map, no need for the transform in the middle.) I've kept the informative errors; it's easy to add a reporting mode that bails more eagerly later.
| // attribute in a transaction efficiently. | ||
| // | ||
| // - Second, it provides a layer of assurance that the transactor is correct by asserting | ||
| // that a single dato `[e a v]` is added or retracted at most once in a single |
| // - Second, it provides a layer of assurance that the transactor is correct by asserting | ||
| // that a single dato `[e a v]` is added or retracted at most once in a single | ||
| // transaction. | ||
| r#"CREATE UNIQUE INDEX idx_transactions_unique_tx_a_e_v ON transactions (tx, a, e, v)"#, |
There was a problem hiding this comment.
Good opportunity to write a disk-level migration as an example, and avoid breaking any test DBs…
There was a problem hiding this comment.
Meh, I don't really care to do this. I'm going to leave as is and think more about what indexing we really want post-trie.
| // the same [e a v] twice in one transaction, but we don't want to represent the transacted | ||
| // datom twice in datoms. Hence we IGNORE repeated datoms here. This is a choice: we could | ||
| // also push the uniqueness constraints into the search results themselves, or SELECT DISTINCT | ||
| // from the search results. This exploits the existing UNIQUE indices on the `datoms` table. |
There was a problem hiding this comment.
Another approach to doing this is one I remember outlining in #535 — if you build a trie instead of a collection of terms, you have a perfect opportunity to dedupe, and you don't need to do it here. Thoughts?
There was a problem hiding this comment.
Yeah, I'm happy to do this. There are a few transformations that would improve the efficiency of this code:
- not collecting
Vec<>as frequently -- much easier now thatimpl Traitis stable - turning the main (hard) loop into an iterator, again avoiding a
collect - turning
final_termsand the type/cardinality checking into a single pass that collects a better data structure.
It's not clear that a trie really wins here -- would you be happy with nested maps, like I'm using in the cardinality checking phase?
There was a problem hiding this comment.
I mostly mean a trie in the sense of shared prefixing, so yes.
Consider that the transactor always requires a fixed attribute. That becomes our root (just as it is for the cache!).
The entities and values can be lookup refs, functions, tempids, or known values.
We can build, for each attribute, entity to single value mappings for cardinality-one attributes, entity to multiple values mappings for cardinality-many. We can track value to entity reverse mappings for unique attributes. Intra-transaction cardinality violations and duplicates become obvious — they're hash collisions. Uniqueness violations/collisions/smushing, similarly. Perhaps two-tempids-are-the-same is as simple as "oh, the entity for this unique value is another tempid, they must be the same".
I think it should be possible to rephrase the various resolution transitions in the transactor phases as movements between these maps, or replacements of entries with one kind of Either for another.
We lose some amount of clarity, and might get a little bit less type system support, but in exchange we would get different kinds of collisions, and of course prefix compression, for free.
| ]"#); | ||
|
|
||
| // Observe that "foo" and "zot" upsert to the same entid, and that doesn't cause a | ||
| // cardinality constraint, because we treat the input with set semantics and accept |
There was a problem hiding this comment.
Nit: constraint -> conflict
| OpType::Retract => &mut retracted, | ||
| }; | ||
| map | ||
| .entry(e).or_insert(BTreeMap::default()) |
| where I: IntoIterator<Item=&'a AttributedTerm<'a>> { | ||
| let mut errors = vec![]; | ||
|
|
||
| let mut added: BTreeMap<Entid, BTreeMap<(Entid, bool), BTreeSet<&TypedValue>>> = BTreeMap::default(); |
There was a problem hiding this comment.
You're basically building a trie here. If you slightly tweak your approach, at this point you can:
- Look up attributes, find the value type once, and make sure each incoming value is of the right type.
- Implement the one/many distinction in two tries, using the presence of an
Entryas a cardinality failure. - Drop
Terms at this point and use the tries instead.
It seems wasteful to do all of this work and then not use the output for other things!
Also: it would be worth considering whether a BTreeMap and BTreeSet are the right data structures to use. Perhaps switch between Hash* and BTree* and use cargo cli to .i fixtures/cities.schema, .i fixtures/all_seattle.edn and see which is faster? I suspect the hash versions will be better.
rnewman
left a comment
There was a problem hiding this comment.
I hesitate to say this is using a sledgehammer to crack a nut, but it kind of is!
My overall fear with this is that the transactor is accruing transformation after transformation — term to term to term, then into maps and maps and graphs to detect duplicates and equivalences, and so on — each of which adds expense, but is in large part present to handle edge cases.
I'm interested to know if we can apply just one of these many hammers to do more of the work. If we pick a different data structure, can we do more work in fewer passes?
| /// Entities that look like: | ||
| /// - [:db/add TEMPID b OTHERID], where b is not :db.unique/identity; | ||
| /// - [:db/add TEMPID b v], where b is not :db.unique/identity. | ||
| /// - [:db/add TEMPID b OTHERID]. where b is not :db.unique/identity; |
| /// are the tempids that appeared in [:db/add ...] entities, but that didn't upsert to existing | ||
| /// entids. | ||
| pub(crate) fn temp_ids_in_allocations(&self) -> BTreeSet<TempIdHandle> { | ||
| /// After evolution is complete, yield the set of tempids that require entid allocation are the |
There was a problem hiding this comment.
Finish rewriting this sentence :)
| } | ||
|
|
||
| temp_ids | ||
| // Now we union-find all the known tempids. Two tempids are unioned if they both appears as |
) @mmacedoeu did a good deal of work to show that Arc instead of Rc wasn't too difficult in mozilla#656, and @rnewman pushed the refactoring across the line in mozilla#659. However, we didn't flip the switch at that time. For mozilla#673, we'd like to include TypedValue instances in errors, and with error-chain (and failure) error types need to be 'Sync + 'Send, so we need Arc. This builds on mozilla#659 and should also finish mozilla#656.
) r=rnewman @mmacedoeu did a good deal of work to show that Arc instead of Rc wasn't too difficult in #656, and @rnewman pushed the refactoring across the line in #659. However, we didn't flip the switch at that time. For #673, we'd like to include TypedValue instances in errors, and with error-chain (and failure) error types need to be 'Sync + 'Send, so we need Arc. This builds on #659 and should also finish #656.
86a37a1 to
c6dab2b
Compare
|
@rnewman ok, back to you. This is a little wonky, 'cuz I dropped the part about |
rnewman
left a comment
There was a problem hiding this comment.
I'm going to need to read this change closely, but figured I'd hit "Request changes" first 'cos you've missed a bunch of the nits I left in the last review — just read through the diff and take a look!
| // We're about to write, so go straight ahead and get an IMMEDIATE transaction. | ||
| let tx = self.sqlite.transaction_with_behavior(TransactionBehavior::Immediate)?; | ||
| // Applying the transaction can fail, so we don't unwrap. | ||
| let details = transact_terms(&tx, self.partition_map.clone(), &self.schema, &self.schema, NullWatcher(), terms, tempid_set)?; |
There was a problem hiding this comment.
I think you're missing a watcher argument somewhere… you shouldn't need to create one here.
I looked through the nits pretty closely, and have a final commit addressing the ones I thought needed to be addressed. Can you reflag at least a couple so I can see what I missed? |
| // - Second, it provides a layer of assurance that the transactor is correct by asserting | ||
| // that a single datom `[e a v]` is added or retracted at most once in a single | ||
| // transaction. | ||
| r#"CREATE UNIQUE INDEX idx_transactions_unique_tx_a_e_v ON transactions (tx, a, e, v)"#, |
There was a problem hiding this comment.
Nit: I think you noted that you wanted to get rid of this.
| // datom twice, so we exploit the idx_transactions_unique_tx_a_e_v on transactions to IGNORE | ||
| // duplicated datoms coming from our search results. This is a choice: we could also ensure | ||
| // that the search results satisfied the uniqueness constraints, or we could SELECT DISTINCT | ||
| // from the search results. We're doing this since we need an index no transactions anyway, so |
| ]"#); | ||
|
|
||
| // Observe that "foo" and "zot" upsert to the same entid, and that doesn't cause a | ||
| // cardinality constraint, because we treat the input with set semantics and accept |
There was a problem hiding this comment.
Nit: constraint -> conflict
| /// Entities that look like: | ||
| /// - [:db/add TEMPID b OTHERID], where b is not :db.unique/identity; | ||
| /// - [:db/add TEMPID b v], where b is not :db.unique/identity. | ||
| /// - [:db/add TEMPID b OTHERID]. where b is not :db.unique/identity; |
This should address mozilla#663, by re-inserting type checking in the transactor stack after the entry point used by the term builder.
Before this commit, we were using an SQLite UNIQUE index to assert that no `[e a]` pair, with `a` a cardinality one attribute, was asserted more than once. However, that's not in line with Datomic, which treats transaction inputs as a set and allows a single datom like `[e a v]` to appear multiple times. It's both awkward and not particularly efficient to look for _distinct_ repetitions in SQL, so we accept some runtime cost in order to check for repetitions in the transactor. This will allow us to address mozilla#532, which is really about whether we treat inputs as sets. A side benefit is that we can provide more helpful error messages when the transactor does detect that the input truly violates the cardinality constraints of the schema.
This completes the transition to Datomic-like input-as-set semantics, which allows us to address mozilla#532. Previously, two tempids that upserted to the same entid would produce duplicate datoms, and that would have been rejected by the transactor -- correctly, since we did not allow duplicate datoms under the input-as-list semantics. With input-as-set semantics, duplicate datoms are allowed; and that means that we must allow tempids to be equivalent, i.e., to resolve to the same tempid. To achieve this, we: - index the set of tempids - identify tempid indices that share an upsert - map tempids to a dense set of contiguous integer labels We use the well-known union-find algorithm, as implemented by petgraph, to efficiently manage the set of equivalent tempids.
I don't know quite what happened here. The Clojure implementation correctly kept complex upserts that hadn't resolved as complex upserts (see https://github.com/mozilla/mentat/blob/9a9dfb502acf5e4cdb1059d4aac831d7603063c8/src/common/datomish/transact.cljc#L436) and then allocated complex upserts if they didn't resolve (see https://github.com/mozilla/mentat/blob/9a9dfb502acf5e4cdb1059d4aac831d7603063c8/src/common/datomish/transact.cljc#L509). Based on the code comments, I think the Rust implementation must have incorrectly tried to optimize by handling all complex upserts in at most a single generation of evolution, and that's just not correct. We're effectively implementing a topological sort, using very specific domain knowledge, and its not true that a node in a topological sort can be considered only once!
…l terms. This should be more efficient. It also allows a simpler expression of input-provided :db/txInstant datoms, which in turn uncovered a small issue with the transaction watcher, where-by the watcher would not see non-input-provided :db/txInstant datoms.
c6dab2b to
854cbf9
Compare
| conflicting_datoms: BTreeMap<(Entid, Entid, TypedValue), ValueType> | ||
| }, | ||
|
|
||
| /// A transaction tried to assert a datoms that don't observe the schema's cardinality constraints. |
| Ok(()) | ||
| }, | ||
| &TypeDisagreements { ref conflicting_datoms } => { | ||
| write!(f, "type disagreements:\n")?; |
There was a problem hiding this comment.
writeln! would be clearer.
| &TypeDisagreements { ref conflicting_datoms } => { | ||
| write!(f, "type disagreements:\n")?; | ||
| for (ref datom, expected_type) in conflicting_datoms { | ||
| write!(f, " expected value of type {} but got datom [{} {} {:?}]\n", expected_type, datom.0, datom.1, datom.2)?; |
| // We need to ensure that callers can't blindly transact entities that haven't been | ||
| // allocated by this store. | ||
|
|
||
| let errors = tx_checking::type_disagreements(&aev_trie); |
There was a problem hiding this comment.
I'm a bit surprised why this isn't a side-effect of insertion. After all, you know the attribute at the moment you're finalizing that datom, and so you also know which type the value must be…
There was a problem hiding this comment.
This is a choice. As I said (on Slack?) we really want to compute in two different monads: one that exits early and one that collects all errors, for the best consumer experience. I think it's harder to go from one-error to collecting-errors than the reverse, so I implemented the global one first.
| bail!(ErrorKind::SchemaConstraintViolation(errors::SchemaConstraintViolation::TypeDisagreements { conflicting_datoms: errors })); | ||
| } | ||
|
|
||
| let errors = tx_checking::cardinality_conflicts(&aev_trie); |
There was a problem hiding this comment.
Same here. You know there are two assertions because you encounter the other one while you're inserting…
| fn extend_aev_trie<'schema, I>(schema: &'schema Schema, terms: I, trie: &mut AEVTrie<'schema>) -> Result<()> | ||
| where I: IntoIterator<Item=TermWithoutTempIds> | ||
| { | ||
| for Term::AddOrRetract(op, KnownEntid(e), a, v) in terms.into_iter() { |
There was a problem hiding this comment.
Consider making this a peekable iterator. If the next datom uses the same attribute, save some work. Even better if the next datom also uses the same entity…
There was a problem hiding this comment.
I think there will come a time when such micro-optimizations are worth the effort, but I don't think it is now.
| where I: IntoIterator<Item=TermWithoutTempIds> | ||
| { | ||
| for Term::AddOrRetract(op, KnownEntid(e), a, v) in terms.into_iter() { | ||
| let attribute: &Attribute = schema.require_attribute_for_entid(a)?; |
There was a problem hiding this comment.
At this moment you have the Attribute, so you can type-check v. (And if you do so at this point, you can probably not bother keeping the Attribute ref around — you only need it to decide multi/single and FTS/non.)
|
|
||
| /// Ensure that the given terms type check. | ||
| /// | ||
| /// We try to be maximally helpful by yielding every malformed datom, rather than only the first. |
There was a problem hiding this comment.
I would rather see this done by receiving an error on insert of each datom and collecting the errors, continuing anyway. It won't catch every error, but it'll catch most, and it's a streaming approach that fails early in the common case.
|
|
||
| for (&(a, attribute), evs) in aev_trie { | ||
| for (&e, ref ars) in evs { | ||
| if !attribute.multival && ars.add.len() > 1 { |
There was a problem hiding this comment.
I opted to instead have separate single/multi-val data structures when building the cache. Naturally that only works if you don't need to keep multiple values for cardinality-one with late error reporting!
| // for tempids in sorted order. This has the effect of making "a" allocate before "b", | ||
| // which is pleasant for testing. | ||
| for (tempid, tempid_index) in temp_ids { | ||
| let rep = uf.find_mut(tempid_index); |
There was a problem hiding this comment.
Does this handle transitive equivalence? That is, if "foo" is equivalent to "bar" because of :foo/x 123, and "bar" is equivalent to "baz" because of :foo/y 456, do foo, bar, and baz` all end up in the same bucket here?
There was a problem hiding this comment.
Yes, but let me make sure there's a test of exactly this. You want:
[[:db/add "foo" :foo/x 123]
[:db/add "bar" :foo/y 456]
{:db/id "baz" :foo/x 123 :foo/y 456}]
to unify all of "foo", "bar", "baz"? I guess we should test with those datoms present and not present. I've added a test that covers this.
There was a problem hiding this comment.
In this PR you're making it so that
[:db/add "foo" :foo/x 123]
[:db/add "bar" :foo/x 123]
correctly unify; the next step is to make sure that
[:db/add "foo" :foo/x 123]
[:db/add "bar" :foo/x 123]
[:db/add "bar" :foo/y 456]
[:db/add "baz" :foo/y 456]
works. The subsequent insert itself will fail if it's not possible for all three entities to be the same, I think.
|
I've landed this in 46c2a08. Still not perfect, but better than what we had, and I need to get it out of my queue. |
This addresses #532 and should address #663. It's ready for initial review but it's not quite there yet -- part 2 needs additional tests to verify that it really does address #663, and part 3 needs a list of routine tests so that issues are found in a dedicated portion of the test suite, and so that we verify that the error output is as desired.