Schema alteration.#370
Conversation
| use edn::NamespacedKeyword; | ||
|
|
||
| impl Schema { | ||
| pub fn get_ident(&self, x: Entid) -> Option<&NamespacedKeyword> { |
There was a problem hiding this comment.
For the sake of thoroughness, can we mark all of these as #[inline]?
| /// Associate the given ident with given entid if neither are already recognized. | ||
| /// | ||
| /// Return `true` if an association was added, `false` otherwise. | ||
| pub fn associate_ident(&mut self, i: NamespacedKeyword, e: Entid) -> bool { |
There was a problem hiding this comment.
Did you consider returning Result<()>? Result has the advantage of forcing the caller to check the return value…
There was a problem hiding this comment.
If I associate an ident with an entid — which I can do, 'cos this is public — that hasn't already been allocated, all hell will break loose, right?
schema.associate_ident(NamespacedKeyword::new("foo", "bar"), 666);
// Now transact ~600 tempids… one of them will magically be given the ident :foo/bar!
Can you validate here that the entid is within the existing part range? Or document that the caller should ensure that it is?
| pub fn associate_ident(&mut self, i: NamespacedKeyword, e: Entid) -> bool { | ||
| use std::collections::btree_map::Entry; | ||
| match (self.entid_map.entry(e), self.ident_map.entry(i.clone())) { | ||
| (Entry::Vacant(entid_entry), Entry::Vacant(ident_entry)) => { |
There was a problem hiding this comment.
You can get ever so slightly better performance via:
if let Entry::Vacant(entid_entry) = self.entid_map.entry(e) {
if let Entry::Vacant(…) = … {
entid_entry.insert(i);
ident_entry.insert(e);
return true;
}
}
false
— don't bother looking up the second if the first isn't vacant.
| pub fn update_ident(&mut self, i: NamespacedKeyword, e: Entid) -> bool { | ||
| use std::collections::btree_map::Entry; | ||
| if let Entry::Occupied(mut entid_entry) = self.entid_map.entry(e) { | ||
| // TODO: avoid clone? |
There was a problem hiding this comment.
To avoid the clone here you can use the return value of remove to decide whether to return false. You don't do anything with the second entry beyond making sure that it's occupied before removing it; remove returns None if there was nothing to remove.
|
|
||
| lazy_static! { | ||
| pub static ref ATTRIBUTE_SET: String = { | ||
| // let mut stmt = self.prepare_cached("SELECT e, a, v, value_type_tag, added FROM transactions WHERE tx = ? AND a IN |
| } | ||
|
|
||
| impl AttributeBuilder { | ||
| pub fn value_type<'a>(&'a mut self, value_type: ValueType) -> &'a mut Self { |
There was a problem hiding this comment.
Why mutable references rather than ownership? Surely the model is
let foo = AttributeBuilder::new().value_type(ValueType::Boolean).build();
which can be modeled as
pub fn value_type(self, value_type: ValueType) -> Self {
no?
There was a problem hiding this comment.
This is not being consumed in a fluent interface (like Builder::new().thing1().thing2().build()), but in a loop that grabs the builder reference, does a thing, and then finally builds everything at the end. It's equivalent to consume self and update the &mut Builder in the map entry, I guess.
|
|
||
| /// Return `false` if the given reduced entity will not change the metadata: recognized idents, | ||
| /// schema, partitions in the partition map. | ||
| pub fn might_update_metadata(reduced: &db::ReducedEntity) -> bool { |
There was a problem hiding this comment.
I'd suggest marking this as #[inline], see if it speeds up the transactor tight loop.
| // Idents. | ||
| entids::DB_IDENT | | ||
| // Schema. | ||
| entids::DB_ALTER_ATTRIBUTE | |
There was a problem hiding this comment.
Could you put these in numeric order, with the numeric value from entids.rs alongside?
I'd also be interested in an optimization here: hard-code the range in entids.rs (1–35, I think), and bail early if the value isn't in that range. No point painstakingly comparing 220 to 11 different small integers only to return false. The vast majority of assertions won't be in this range, so I'm keen to make this function ~free in the common case.
There was a problem hiding this comment.
I'd be happy to optimize this eventually. I believe that Rust implements this via a jump table in some form, so it's ~free already.
|
|
||
| for (e, a, typed_value, added) in assertions.into_iter() { | ||
| let attribute = schema.require_attribute_for_entid(a)?; | ||
| // XXX TODO NOTE |
|
|
||
| let added = op == OpType::Add; | ||
| let reduced = (e, a, attribute, v, added); | ||
| if metadata::might_update_metadata(&reduced) { |
There was a problem hiding this comment.
This function doesn't use most of the arguments. Can you just pass e by value instead?
|
@rnewman OK, this is ready for review. This shares very little code with what you saw before; I hope I captured the spirit (or explicitly addressed) your major concerns, but I apologize if I'm missing stuff in the churn. I've been working on this too long without landing pieces, for which I apologise. Turned out to be a complex thing to get right. |
rnewman
left a comment
There was a problem hiding this comment.
I'm not entirely sure I understand your division of entids vs idents — I would expect both idents and entids to be scoped to current data, and the for lookup in either direction to be a partial injective function. As I understand that part, you've made keyword to entid partial and injective, but entid to keyword partial and multivalued Am I misreading that commit message?
| let typed_value = TypedValue::from_sql_value_pair(v, value_type_tag)?; | ||
| let typed_value = match TypedValue::from_sql_value_pair(v, value_type_tag)? { | ||
| TypedValue::Ref(e) => { | ||
| borrowed_schema.get_ident(e).cloned().map(TypedValue::Keyword).unwrap_or(TypedValue::Ref(e)) |
There was a problem hiding this comment.
You do this twice, so why not add a trait to TypedValue that ident-maps with a provided schema?
trait ToIdent {
fn map_ident(self, schema: &Schema) -> Self;
}| } | ||
| } else { | ||
| if let Some(asserted_value) = self.asserted.remove(&key) { | ||
| self.altered.insert(key, (value, asserted_value)); |
There was a problem hiding this comment.
I don't understand this. If there was an asserted value, and we get a removal, store an alteration for the original value instead?
There was a problem hiding this comment.
Oh, I see. If you get the addition before the retraction, retract the old value and assert the new one.
| /// Witness assertions and retractions, folding (assertion, retraction) pairs into alterations. | ||
| /// Assumes that no assertion or retraction will be witnessed more than once. | ||
| #[derive(Clone, Debug, Eq, Hash, Ord, PartialOrd, PartialEq)] | ||
| pub struct DiffSet<K, V> { |
There was a problem hiding this comment.
This looks like a much more general data structure than it actually is! I was having trouble reasoning about what happens in some orderings, what happens if there is or isn't assumed to be data outside… but all it really does is filter out and categorize pairs, so given [assert(2), retract(1)] you don't get [], you get [alter(1, 2)].
There was a problem hiding this comment.
Could you add a little more commentary and/or a test to explain what happens for various inputs?
There was a problem hiding this comment.
Correct. I will certainly elaborate and add a few tests. It was just a thing I needed twice -- for attributes and for idents. Feel free to suggest another name, if you prefer.
| } | ||
|
|
||
| for &entid in &metadata_report.attributes_installed { | ||
| conn.execute(format!("INSERT INTO schema SELECT e, a, v, value_type_tag FROM datoms WHERE e = ? AND a IN {}", entids::SCHEMA_SET.as_str()).as_str(), |
There was a problem hiding this comment.
Prepared statements inside loops, please!
There was a problem hiding this comment.
In general, I'm for cached statements (especially since Rusqlite makes them easy to use) but I intentionally didn't optimize this process. I expect it to be infrequent ... oh, prepared vs. cached. Prepared I will do! This is not efficient at this time because walking transactions in this way is very expensive but it's easy to improve and I'll file follow-up to track that.
|
Sorry, in that comment I reversed the directions; I think you get my gist, though. With historical idents stored in the schema, how do we recognize the current mapping? |
I think you are. I think the following example will be instructive, and I'll fold it into a comment or commit message once we have shared understanding. Suppose I transact a single ident: [[:db/add 100 :db/ident :test/ident]]Then both the "entids" and "idents" materialized views will look (symbolically) like: [[100 :db/ident :test/ident]]If I now transact the following datoms: [[:db/retract 100 :db/ident :test/ident]]
[:db/add 100 :db/ident :test/updated]]the "entids" materialized view will look like: [[100 :db/ident :test/updated]]but the "idents" materialized view will maintain the historical values, like: [[100 :db/ident :test/ident]
[100 :db/ident :test/updated]]In terms of the resulting maps,
This is what Datomic does, and I think it's the right call. It means that when the Is that more clear? |
I think so. I think I have two points of friction:
|
|
@rnewman sorry about the force-push. This gets us to the following place:
I'll file follow-up for the last two if I don't do them as part of the next review cycle. Sorry for all this churn! |
| } | ||
|
|
||
| #[derive(Clone,Debug,Eq,Hash,Ord,PartialOrd,PartialEq)] | ||
| pub enum Unique { |
There was a problem hiding this comment.
Can we call this AttributeUniqueness or something more specific? This is in core.
There was a problem hiding this comment.
Or a submodule for scoping? attribute::Unique::Value?
There was a problem hiding this comment.
Thanks for correcting me on this; I should have known better. I've pushed it into an attribute submodule.
| if next_schema.is_some() { | ||
| // TODO Use custom ErrorKind https://github.com/brson/error-chain/issues/117 | ||
| bail!(ErrorKind::NotYetImplemented(format!("Initial bootstrap transaction did not produce expected bootstrap schema"))); | ||
| let bootstrap_schema_for_mutation = Schema::default(); // The bootstrap transaction will populate this schema. |
| fn read_schema_map(conn: &rusqlite::Connection) -> Result<SchemaMap> { | ||
| let entid_triples = read_materialized_view(conn, "schema")?; | ||
| let mut schema_map = SchemaMap::default(); | ||
| metadata::update_schema_map_from_entid_triples(&mut schema_map, entid_triples)?; |
There was a problem hiding this comment.
Can you push this into a method on SchemaMap? SchemaMap::with_entid_triples?
There was a problem hiding this comment.
I could, but I want the mutation form so it's &mut self in general. There's a symmetry in having the two updating functions close together (one operating on SchemaMap, the other on Schema) so I'm going to keep them in metadata.rs until we have more changes to make.
| &[])?; | ||
| // Take the last, as determined by tx, added [e :db/ident v _ true]. This will pick up | ||
| // :db/idents that have been retracted and/or re-purposed. | ||
| conn.execute(format!("INSERT INTO idents SELECT e, a, v, value_type_tag FROM datoms WHERE a IN {}", entids::IDENTS_SET.as_str()).as_str(), |
There was a problem hiding this comment.
Careful: I suspect that this is implemented as a table walk with a filter, at least for the case of non-unit sets, which is startlingly expensive. You probably want to do this differently in the short or medium term.
There was a problem hiding this comment.
Oh, and filtering on value_type_tag to keyword is something you might consider.
There was a problem hiding this comment.
I expected this to be efficient because there's an index on e a v value_type_tag. Do you think it would be faster to have a bunch of WHERE a = ? OR a = ? clauses? A little playing with SQLite suggests the query plans are the same:
sqlite> EXPLAIN QUERY PLAN INSERT INTO idents SELECT e, a FROM datoms WHERE a = 0 OR a = 1;
0|0|0|SEARCH TABLE datoms USING COVERING INDEX idx_datoms_aevt (a=?)
0|0|0|EXECUTE LIST SUBQUERY 1
sqlite> EXPLAIN QUERY PLAN INSERT INTO idents SELECT e, a FROM datoms WHERE a IN (0, 1);
0|0|0|SEARCH TABLE datoms USING COVERING INDEX idx_datoms_aevt (a=?)
0|0|0|EXECUTE LIST SUBQUERY 1
Feel free to file follow-up if you're not convinced.
There was a problem hiding this comment.
Query plan trumps vague suspicions!
| // TODO: use concat! to avoid creating String instances. | ||
| if !metadata_report.idents_altered.is_empty() { | ||
| // Idents is the materialized view of the [entid :db/ident ident _ added] slice of | ||
| // transactions. This keeps outdated idents around, just like Datomic does. |
There was a problem hiding this comment.
I don't think this comment is still accurate.
| let mut insert_stmt = conn.prepare(format!("INSERT INTO schema SELECT e, a, v, value_type_tag FROM datoms WHERE e = ? AND a IN {}", entids::SCHEMA_SET.as_str()).as_str())?; | ||
| let mut index_stmt = conn.prepare("UPDATE datoms SET index_avet = ? WHERE a = ?")?; | ||
| let mut unique_value_stmt = conn.prepare("UPDATE datoms SET unique_value = ? WHERE a = ?")?; | ||
| let mut cardinality_stmt = conn.prepare("SELECT e FROM datoms WHERE a = ? GROUP BY e HAVING COUNT(*) > 1 LIMIT 1")?; |
There was a problem hiding this comment.
I might rephrase this to something potentially cheaper:
SELECT EXISTS (
SELECT 1 FROM datoms AS one, datoms AS two
WHERE
one.a = ?1 AND two.a = ?1 AND
one.e = two.e AND
one.v <> two.v
)
AS dupes;
There was a problem hiding this comment.
Sure, thanks for the suggestion.
| }; | ||
|
|
||
| /// Attributes that are "schema related". These might change the "schema" materialized view. | ||
| pub static ref SCHEMA_SET: String = { |
There was a problem hiding this comment.
Yikes. Call out that these are SQL lists, not Rust sets!
This is consistent with the capitalization (which is "valueType") and the other identifier.
Datomic (and eventually Mentat) don't allow to retract :db/ident in this way, so this runs afoul of future work to support mutating metadata.
This looks ahead to a time when we could support arbitrary user-defined materialized views. For now, the "idents" materialized view is those datoms of the form [e :db/ident :namespaced/keyword] and the "schema" materialized view is those datoms of the form [e a v] where a is in a particular set of attributes that will become clear in the following commits. This change is not backwards compatible, so I'm removing the open current (really, v2) test. It'll be re-instated when we get to mozilla#194.
In order to support historical idents, we need to distinguish the "current" map from entid -> ident from the "complete historical" map ident -> entid. This is what Datomic does; in Datomic, an ident is never retracted (although it can be replaced). This approach is an important part of allowing multiple consumers to share a schema fragment as it migrates forward. This fixes a limitation of the Clojure implementation, which did not handle historical idents across knowledge base close and re-open. The "entids" materialized view is naturally a slice of the "datoms" table. The "idents" materialized view is a slice of the "transactions" table. I hope that representing in this way, and casting the problem in this light, might generalize to future materialized views.
This is just to keep track of the expected changes during bootstrapping. I want bootstrap metadata mutations to flow through the same code path as metadata mutations during regular transactions; by differentiating the schema used for interpretation from the schema that will be updated I expect to be able to apply bootstrap metadata mutations to an empty schema and have things like materialized views created (using the regular code paths). This commit has been re-ordered for conceptual clarity, but it won't compile because it references the metadata module. It's possible to make it compile -- the functionality is there in the schema module -- but it's not worth the rebasing effort until after review (and possibly not even then, since we'll squash down to a single commit to land).
I haven't taken your review comment about consuming AttributeBuilder during each fluent function. If you read my response and still want this, I'm happy to do it in review.
This "loops" the committed datoms out of the SQL store and back
through the metadata (schema, but in future also partition map)
processor. The metadata processor updates the schema and produces a
report of what changed; that report is then used to update the SQL
store. That update includes:
- the materialized views ("entids", "idents", and "schema");
- if needed, a subset of the datoms themselves (as flags change).
I've left a TODO for handling attribute retraction in the cases that
it makes sense. I expect that to be straight-forward.
Also adds a little more commentary and a simple test.
…from idents." This reverts commit 23a91df. Following our discussion, this removes the "entids" materialized view. The next commit will remove historical idents from the "idents" materialized view.
This is not necessary, but it was suggested that we might be paying an overhead creating Err instances while using error_chain. That seems not to be the case, but this change shows that we don't actually use any of the Result helper methods, so there's no reason to overload Result. This change might avoid some future confusion, so I'm going to land it anyway. Signed-off-by: Nick Alexander <nalexander@mozilla.com>
These tests fail due to a Datomic limitation, namely that the marker flag :db.alter/attribute can only be asserted once for an attribute! That is, [:db.part/db :db.alter/attribute :attribute] will only be transacted at most once. Since older versions of Datomic required the :db.alter/attribute flag, I can only imagine they either never wrote :db.alter/attribute to the store, or they handled it specially. I'll need to remove the marker flag system from Mentat in order to address this fundamental limitation.
I was finding it very difficult to track unwrapping errors while making changes, due to an underlying Mac OS X symbolication issue that makes running tests with RUST_BACKTRACE=1 so slow that they all time out.
I had this all working... except we will never see a repeated
`[:db.part/db :db.alter/attribute :attribute]` assertion in the store!
That means my approach would let you alter an attribute at most one
time. It's not worth hacking around this; it's better to just stop
expecting (and recognizing) the marker flags. (We have all the data
to distinguish the various cases that we need without the marker
flags.)
This brings Mentat in line with the thrust of newer Datomic versions,
but isn't compatible with Datomic, because (if I understand correctly)
Datomic automatically adds :db.{install,alter}/attribute assertions to
transactions.
I haven't purged the corresponding :db/ident and schema fragments just
yet:
- we might want them back
- we might want them in order to upgrade v1 and v2 databases to the
new on-disk layout we're fleshing out (v3?).
This patch avoids a potential bug with the "schema" materialized view.
If :db/unique :db.unique/value implies :db/index true, then what
happens when you _retract_ :db.unique/value? I think Datomic defines
this in some way, but I really want the "schema" materialized view to
be a slice of "datoms" and not have these sort of ambiguities and
persistent effects. Therefore, to ensure that we don't retract a
schema characteristic and accidentally change more than we intended
to, this patch stops having any schema characteristic imply any other
schema characteristic(s). To achieve that, I added an
Option<Unique::{Value,Identity}> type to Attribute; this helps with
this patch, and also looks ahead to when we allow to retract
:db/unique attributes.
The tests use strings, so they hide the chained errors which do in fact provide more detail.
This might be faster in practice.
This is WIP towards #294 and #295 and the equivalent for
:db/ident. Not quite ready for review yet.