Improve type checking and constraint checking; treat input as set. by ncalexan · Pull Request #673 · mozilla/mentat

ncalexan · 2018-05-01T23:24:09Z

This addresses #532 and should address #663. It's ready for initial review but it's not quite there yet -- part 2 needs additional tests to verify that it really does address #663, and part 3 needs a list of routine tests so that issues are found in a dedicated portion of the test suite, and so that we verify that the error output is as desired.

rnewman · 2018-05-02T15:48:28Z

 /// `TypedValue`.
 ///
-pub type ValueRc<T> = Rc<T>;
+pub type ValueRc<T> = Arc<T>;


Can we make this its own tiny PR?

(And comment about it in #656)

Needs a rebase on master…

rnewman · 2018-05-02T15:53:15Z

        conflicting_upserts: BTreeMap<TempId, BTreeSet<KnownEntid>>,
    },
+
+    /// A transaction tried to assert a datom or datoms with the wrong value `v type(s).


You're missing a backtick.

rnewman · 2018-05-02T15:56:12Z

@@ -0,0 +1,46 @@
+// Copyright 2016 Mozilla


rnewman · 2018-05-02T15:57:14Z

+        }
+    }).collect();
+
+    errors


If you put errors.collect() here instead of on 43, you don't need the type declaration on 33.

rnewman · 2018-05-02T15:59:14Z

+    let errors: BTreeMap<_, _> = terms.into_iter().filter_map(|term| {
+        match term {
+            &Term::AddOrRetract(_, e, (a, attribute), ref v) => {
+                if attribute.value_type != v.value_type() {


I suspect that for programmatic usage we should fail quickly rather than slowly; walking every term one more time seems like a worse choice than inserting a check at the point that we construct the AttributedTerm. Thoughts?

I've reworked this to use a trie, and to get rid of AttributedTerm altogether. (I was trying to mix filter_map with Result, which is hard -- hence AttributedTerm. Without filter_map, no need for the transform in the middle.) I've kept the informative errors; it's easy to add a reporting mode that bails more eagerly later.

rnewman · 2018-05-02T16:01:07Z

+        //   attribute in a transaction efficiently.
+        //
+        // - Second, it provides a layer of assurance that the transactor is correct by asserting
+        //   that a single dato `[e a v]` is added or retracted at most once in a single


rnewman · 2018-05-02T16:03:13Z

+        // - Second, it provides a layer of assurance that the transactor is correct by asserting
+        //   that a single dato `[e a v]` is added or retracted at most once in a single
+        //   transaction.
+        r#"CREATE UNIQUE INDEX idx_transactions_unique_tx_a_e_v ON transactions (tx, a, e, v)"#,


Good opportunity to write a disk-level migration as an example, and avoid breaking any test DBs…

Meh, I don't really care to do this. I'm going to leave as is and think more about what indexing we really want post-trie.

rnewman · 2018-05-02T16:04:55Z

+    // the same [e a v] twice in one transaction, but we don't want to represent the transacted
+    // datom twice in datoms.  Hence we IGNORE repeated datoms here.  This is a choice: we could
+    // also push the uniqueness constraints into the search results themselves, or SELECT DISTINCT
+    // from the search results.  This exploits the existing UNIQUE indices on the `datoms` table.


Another approach to doing this is one I remember outlining in #535 — if you build a trie instead of a collection of terms, you have a perfect opportunity to dedupe, and you don't need to do it here. Thoughts?

Yeah, I'm happy to do this. There are a few transformations that would improve the efficiency of this code:

not collecting Vec<> as frequently -- much easier now that impl Trait is stable

turning the main (hard) loop into an iterator, again avoiding a collect

turning final_terms and the type/cardinality checking into a single pass that collects a better data structure.

It's not clear that a trie really wins here -- would you be happy with nested maps, like I'm using in the cardinality checking phase?

I mostly mean a trie in the sense of shared prefixing, so yes.

Consider that the transactor always requires a fixed attribute. That becomes our root (just as it is for the cache!).

The entities and values can be lookup refs, functions, tempids, or known values.

We can build, for each attribute, entity to single value mappings for cardinality-one attributes, entity to multiple values mappings for cardinality-many. We can track value to entity reverse mappings for unique attributes. Intra-transaction cardinality violations and duplicates become obvious — they're hash collisions. Uniqueness violations/collisions/smushing, similarly. Perhaps two-tempids-are-the-same is as simple as "oh, the entity for this unique value is another tempid, they must be the same".

I think it should be possible to rephrase the various resolution transitions in the transactor phases as movements between these maps, or replacements of entries with one kind of Either for another.

We lose some amount of clarity, and might get a little bit less type system support, but in exchange we would get different kinds of collisions, and of course prefix compression, for free.

rnewman · 2018-05-02T16:05:45Z

+        ]"#);
+
+        // Observe that "foo" and "zot" upsert to the same entid, and that doesn't cause a
+        // cardinality constraint, because we treat the input with set semantics and accept


cardinality conflict

Nit: constraint -> conflict

rnewman · 2018-05-02T16:07:16Z

+                    OpType::Retract => &mut retracted,
+                };
+                map
+                    .entry(e).or_insert(BTreeMap::default())


Join to previous line

rnewman · 2018-05-02T16:10:43Z

+where I: IntoIterator<Item=&'a AttributedTerm<'a>> {
+    let mut errors = vec![];
+
+    let mut added: BTreeMap<Entid, BTreeMap<(Entid, bool), BTreeSet<&TypedValue>>> = BTreeMap::default();


You're basically building a trie here. If you slightly tweak your approach, at this point you can:

Look up attributes, find the value type once, and make sure each incoming value is of the right type.

Implement the one/many distinction in two tries, using the presence of an Entry as a cardinality failure.

Drop Terms at this point and use the tries instead.

It seems wasteful to do all of this work and then not use the output for other things!

Also: it would be worth considering whether a BTreeMap and BTreeSet are the right data structures to use. Perhaps switch between Hash* and BTree* and use cargo cli to .i fixtures/cities.schema, .i fixtures/all_seattle.edn and see which is faster? I suspect the hash versions will be better.

rnewman

I hesitate to say this is using a sledgehammer to crack a nut, but it kind of is!

My overall fear with this is that the transactor is accruing transformation after transformation — term to term to term, then into maps and maps and graphs to detect duplicates and equivalences, and so on — each of which adds expense, but is in large part present to handle edge cases.

I'm interested to know if we can apply just one of these many hammers to do more of the work. If we pick a different data structure, can we do more work in fewer passes?

rnewman · 2018-05-02T16:22:49Z

    /// Entities that look like:
-    /// - [:db/add TEMPID b OTHERID], where b is not :db.unique/identity;
-    /// - [:db/add TEMPID b v], where b is not :db.unique/identity.
+    /// - [:db/add TEMPID b OTHERID]. where b is not :db.unique/identity;


Nit: s/where//

rnewman · 2018-05-02T16:23:11Z

-    /// are the tempids that appeared in [:db/add ...] entities, but that didn't upsert to existing
-    /// entids.
-    pub(crate) fn temp_ids_in_allocations(&self) -> BTreeSet<TempIdHandle> {
+    /// After evolution is complete, yield the set of tempids that require entid allocation are the


Finish rewriting this sentence :)

rnewman · 2018-05-02T16:24:00Z

        }

-        temp_ids
+        // Now we union-find all the known tempids.  Two tempids are unioned if they both appears as


both appear

@mmacedoeu

) @mmacedoeu did a good deal of work to show that Arc instead of Rc wasn't too difficult in mozilla#656, and @rnewman pushed the refactoring across the line in mozilla#659. However, we didn't flip the switch at that time. For mozilla#673, we'd like to include TypedValue instances in errors, and with error-chain (and failure) error types need to be 'Sync + 'Send, so we need Arc. This builds on mozilla#659 and should also finish mozilla#656.

@mmacedoeu

) r=rnewman @mmacedoeu did a good deal of work to show that Arc instead of Rc wasn't too difficult in #656, and @rnewman pushed the refactoring across the line in #659. However, we didn't flip the switch at that time. For #673, we'd like to include TypedValue instances in errors, and with error-chain (and failure) error types need to be 'Sync + 'Send, so we need Arc. This builds on #659 and should also finish #656.

ncalexan · 2018-05-04T21:38:43Z

@rnewman ok, back to you. This is a little wonky, 'cuz I dropped the part about AttributedTerm instances, and then I built the trie structure on top of what was already there. I think the end of all that is to fold everything down and accept that it's a pretty big blob :/

rnewman

I'm going to need to read this change closely, but figured I'd hit "Request changes" first 'cos you've missed a bunch of the nits I left in the last review — just read through the diff and take a look!

rnewman · 2018-05-04T23:16:43Z

+                // We're about to write, so go straight ahead and get an IMMEDIATE transaction.
+                let tx = self.sqlite.transaction_with_behavior(TransactionBehavior::Immediate)?;
+                // Applying the transaction can fail, so we don't unwrap.
+                let details = transact_terms(&tx, self.partition_map.clone(), &self.schema, &self.schema, NullWatcher(), terms, tempid_set)?;


I think you're missing a watcher argument somewhere… you shouldn't need to create one here.

Nah, this is test only.

ncalexan · 2018-05-04T23:37:10Z

I'm going to need to read this change closely, but figured I'd hit "Request changes" first 'cos you've missed a bunch of the nits I left in the last review — just read through the diff and take a look!

I looked through the nits pretty closely, and have a final commit addressing the ones I thought needed to be addressed. Can you reflag at least a couple so I can see what I missed?

rnewman

As requested!

rnewman · 2018-05-04T23:40:25Z

+        // - Second, it provides a layer of assurance that the transactor is correct by asserting
+        //   that a single datom `[e a v]` is added or retracted at most once in a single
+        //   transaction.
+        r#"CREATE UNIQUE INDEX idx_transactions_unique_tx_a_e_v ON transactions (tx, a, e, v)"#,


Nit: I think you noted that you wanted to get rid of this.

rnewman · 2018-05-04T23:40:38Z

+    // datom twice, so we exploit the idx_transactions_unique_tx_a_e_v on transactions to IGNORE
+    // duplicated datoms coming from our search results.  This is a choice: we could also ensure
+    // that the search results satisfied the uniqueness constraints, or we could SELECT DISTINCT
+    // from the search results.  We're doing this since we need an index no transactions anyway, so


Nit: index no -> index on

rnewman · 2018-05-04T23:41:06Z

+        ]"#);
+
+        // Observe that "foo" and "zot" upsert to the same entid, and that doesn't cause a
+        // cardinality constraint, because we treat the input with set semantics and accept


Nit: constraint -> conflict

rnewman · 2018-05-04T23:41:36Z

    /// Entities that look like:
-    /// - [:db/add TEMPID b OTHERID], where b is not :db.unique/identity;
-    /// - [:db/add TEMPID b v], where b is not :db.unique/identity.
+    /// - [:db/add TEMPID b OTHERID]. where b is not :db.unique/identity;


Nit: s/where//

This should address mozilla#663, by re-inserting type checking in the transactor stack after the entry point used by the term builder.

Before this commit, we were using an SQLite UNIQUE index to assert that no `[e a]` pair, with `a` a cardinality one attribute, was asserted more than once. However, that's not in line with Datomic, which treats transaction inputs as a set and allows a single datom like `[e a v]` to appear multiple times. It's both awkward and not particularly efficient to look for _distinct_ repetitions in SQL, so we accept some runtime cost in order to check for repetitions in the transactor. This will allow us to address mozilla#532, which is really about whether we treat inputs as sets. A side benefit is that we can provide more helpful error messages when the transactor does detect that the input truly violates the cardinality constraints of the schema.

This completes the transition to Datomic-like input-as-set semantics, which allows us to address mozilla#532. Previously, two tempids that upserted to the same entid would produce duplicate datoms, and that would have been rejected by the transactor -- correctly, since we did not allow duplicate datoms under the input-as-list semantics. With input-as-set semantics, duplicate datoms are allowed; and that means that we must allow tempids to be equivalent, i.e., to resolve to the same tempid. To achieve this, we: - index the set of tempids - identify tempid indices that share an upsert - map tempids to a dense set of contiguous integer labels We use the well-known union-find algorithm, as implemented by petgraph, to efficiently manage the set of equivalent tempids.

I don't know quite what happened here. The Clojure implementation correctly kept complex upserts that hadn't resolved as complex upserts (see https://github.com/mozilla/mentat/blob/9a9dfb502acf5e4cdb1059d4aac831d7603063c8/src/common/datomish/transact.cljc#L436) and then allocated complex upserts if they didn't resolve (see https://github.com/mozilla/mentat/blob/9a9dfb502acf5e4cdb1059d4aac831d7603063c8/src/common/datomish/transact.cljc#L509). Based on the code comments, I think the Rust implementation must have incorrectly tried to optimize by handling all complex upserts in at most a single generation of evolution, and that's just not correct. We're effectively implementing a topological sort, using very specific domain knowledge, and its not true that a node in a topological sort can be considered only once!

…l terms. This should be more efficient. It also allows a simpler expression of input-provided :db/txInstant datoms, which in turn uncovered a small issue with the transaction watcher, where-by the watcher would not see non-input-provided :db/txInstant datoms.

rnewman · 2018-05-14T15:57:36Z

+        conflicting_datoms: BTreeMap<(Entid, Entid, TypedValue), ValueType>
+    },
+
+    /// A transaction tried to assert a datoms that don't observe the schema's cardinality constraints.


rnewman · 2018-05-14T15:58:08Z

                Ok(())
            },
+            &TypeDisagreements { ref conflicting_datoms } => {
+                write!(f, "type disagreements:\n")?;


writeln! would be clearer.

rnewman · 2018-05-14T15:58:16Z

+            &TypeDisagreements { ref conflicting_datoms } => {
+                write!(f, "type disagreements:\n")?;
+                for (ref datom, expected_type) in conflicting_datoms {
+                    write!(f, "  expected value of type {} but got datom [{} {} {:?}]\n", expected_type, datom.0, datom.1, datom.2)?;


Fixed throughout.

rnewman · 2018-05-14T16:00:13Z

        // We need to ensure that callers can't blindly transact entities that haven't been
        // allocated by this store.

+        let errors = tx_checking::type_disagreements(&aev_trie);


I'm a bit surprised why this isn't a side-effect of insertion. After all, you know the attribute at the moment you're finalizing that datom, and so you also know which type the value must be…

This is a choice. As I said (on Slack?) we really want to compute in two different monads: one that exits early and one that collects all errors, for the best consumer experience. I think it's harder to go from one-error to collecting-errors than the reverse, so I implemented the global one first.

rnewman · 2018-05-14T16:00:44Z

+            bail!(ErrorKind::SchemaConstraintViolation(errors::SchemaConstraintViolation::TypeDisagreements { conflicting_datoms: errors }));
+        }
+
+        let errors = tx_checking::cardinality_conflicts(&aev_trie);


Same here. You know there are two assertions because you encounter the other one while you're inserting…

rnewman · 2018-05-14T16:03:40Z

+fn extend_aev_trie<'schema, I>(schema: &'schema Schema, terms: I, trie: &mut AEVTrie<'schema>) -> Result<()>
+where I: IntoIterator<Item=TermWithoutTempIds>
+{
+    for Term::AddOrRetract(op, KnownEntid(e), a, v) in terms.into_iter() {


Consider making this a peekable iterator. If the next datom uses the same attribute, save some work. Even better if the next datom also uses the same entity…

I think there will come a time when such micro-optimizations are worth the effort, but I don't think it is now.

rnewman · 2018-05-14T16:05:40Z

+where I: IntoIterator<Item=TermWithoutTempIds>
+{
+    for Term::AddOrRetract(op, KnownEntid(e), a, v) in terms.into_iter() {
+        let attribute: &Attribute = schema.require_attribute_for_entid(a)?;


At this moment you have the Attribute, so you can type-check v. (And if you do so at this point, you can probably not bother keeping the Attribute ref around — you only need it to decide multi/single and FTS/non.)

rnewman · 2018-05-14T16:07:46Z

+
+/// Ensure that the given terms type check.
+///
+/// We try to be maximally helpful by yielding every malformed datom, rather than only the first.


I would rather see this done by receiving an error on insert of each datom and collecting the errors, continuing anyway. It won't catch every error, but it'll catch most, and it's a streaming approach that fails early in the common case.

rnewman · 2018-05-14T16:09:02Z

+
+    for (&(a, attribute), evs) in aev_trie {
+        for (&e, ref ars) in evs {
+            if !attribute.multival && ars.add.len() > 1 {


I opted to instead have separate single/multi-val data structures when building the cache. Naturally that only works if you don't need to keep multiple values for cardinality-one with late error reporting!

rnewman · 2018-05-14T16:13:56Z

+        // for tempids in sorted order.  This has the effect of making "a" allocate before "b",
+        // which is pleasant for testing.
+        for (tempid, tempid_index) in temp_ids {
+            let rep = uf.find_mut(tempid_index);


Does this handle transitive equivalence? That is, if "foo" is equivalent to "bar" because of :foo/x 123, and "bar" is equivalent to "baz" because of :foo/y 456, do foo, bar, and baz` all end up in the same bucket here?

Yes, but let me make sure there's a test of exactly this. You want:

[[:db/add "foo" :foo/x 123] [:db/add "bar" :foo/y 456] {:db/id "baz" :foo/x 123 :foo/y 456}]

to unify all of "foo", "bar", "baz"? I guess we should test with those datoms present and not present. I've added a test that covers this.

In this PR you're making it so that

[:db/add "foo" :foo/x 123] [:db/add "bar" :foo/x 123]

correctly unify; the next step is to make sure that

[:db/add "foo" :foo/x 123] [:db/add "bar" :foo/x 123] [:db/add "bar" :foo/y 456] [:db/add "baz" :foo/y 456]

works. The subsequent insert itself will fail if it's not possible for all three entities to be the same, I think.

ncalexan · 2018-05-14T22:24:14Z

I've landed this in 46c2a08. Still not perfect, but better than what we had, and I need to get it out of my queue.

ncalexan requested a review from rnewman May 1, 2018 23:24

rnewman reviewed May 2, 2018

View reviewed changes

ncalexan mentioned this pull request May 2, 2018

(tx) Implement :db/retractEntity. (#378) #650

Open

rnewman suggested changes May 2, 2018

View reviewed changes

rnewman reviewed May 2, 2018

View reviewed changes

ncalexan mentioned this pull request May 3, 2018

Finish Rc->Arc conversion: Flip ValueRc to Arc #677

Closed

ncalexan mentioned this pull request May 3, 2018

Flip ValueRc to Arc in order to allow TypedValue in errors. (#677) #678

Merged

ncalexan force-pushed the cardinality-constraints branch from 86a37a1 to c6dab2b Compare May 4, 2018 21:37

rnewman suggested changes May 4, 2018

View reviewed changes

rnewman reviewed May 4, 2018

View reviewed changes

ncalexan mentioned this pull request May 7, 2018

Parse transactions with rust-peg rather than combine #681

Merged

ncalexan added 9 commits May 9, 2018 15:51

Pre: Remove ancient Clojure code comments.

05f65f1

Part 2: Add type checking later in the transactor stack.

c236b8e

This should address mozilla#663, by re-inserting type checking in the transactor stack after the entry point used by the term builder.

[tx] Don't drop datoms resolved by upsert. (mozilla#679)

f619e1c

Review nits.

66f3d88

More significant review reversions.

854cbf9

ncalexan force-pushed the cardinality-constraints branch from c6dab2b to 854cbf9 Compare May 9, 2018 23:25

ncalexan mentioned this pull request May 9, 2018

Cardinality constraints with a trie #685

Merged

rnewman reviewed May 14, 2018

View reviewed changes

ncalexan closed this May 14, 2018

Conversation

ncalexan commented May 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rnewman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncalexan commented May 4, 2018

Uh oh!

rnewman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncalexan commented May 4, 2018

Uh oh!

rnewman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!