[tx] Start implementing bulk SQL insertion algorithms by ncalexan · Pull Request #214 · mozilla/mentat

ncalexan · 2017-01-30T15:48:59Z

This will eventually grow to implement https://github.com/mozilla/mentat/wiki/Transacting:-entity-to-SQL-translation.

This is slightly simpler re-expression of the existing Clojure implementation.

ncalexan · 2017-02-01T23:26:02Z

This is ready for review. As written it doesn't quite match the Wiki, but the spirit is identical. I'll try to expand on the approach and patch the Wiki as I go along.

ncalexan · 2017-02-01T23:26:38Z

I've flagged the world for review, so lay it on me, folks.

jsantell

Some comments, I cannot speak for correctness however

jsantell · 2017-02-02T00:56:11Z

 # TODO: don't depend on num and ordered-float; expose helpers in edn abstracting necessary constructors.
 num = "0.1.35"
 ordered-float = "0.3.0"
+time = "0.1.35"


Is there anyway to keep these deps (and other modules) in sync with the top-level dependencies?

I'm having this trouble, too.

In theory you can do:

[dependencies.ordered-float]

and it'll use whichever one is already in use. Except that just broke for me on Travis, so instead I have to pin very, very carefully. We just need to rely on compiler errors if we get this wrong.

jsantell · 2017-02-02T00:57:19Z

    Ok(DB::new(partition_map, schema))
 }

+use itertools;


Is it common to have these use and type statements in the middle of a file? Also, in general, this is getting pretty long, maybe break it out into a new file?

type more so than use.

I moved the type definitions to types and lifted the use statements to the top of the file. These were just oversights.

Also, re: file length -- yes, this is long. I have Grand Plans for a split between the DB layer and the transaction processing layer but I don't want to make the split until more of the transactor is in place. So eventually I expect two files of similar size, but right now it's just one big file. Sorry!

jsantell · 2017-02-02T00:59:25Z

+                                                       .chain(once(to_bool_ref(index_avet) as &ToSql)
+                                                              .chain(once(to_bool_ref(index_vaet) as &ToSql)
+                                                                     .chain(once(to_bool_ref(index_fulltext) as &ToSql)
+                                                                            .chain(once(to_bool_ref(unique_value) as &ToSql))))))))))


good

golly

miss

molly

I've filed #261 to do better. Please comment there, take, or mentor that ticket!

jsantell · 2017-02-02T01:00:17Z

+    fn search(&self, conn: &rusqlite::Connection) -> Result<()> {
+        // First is fast, only one table walk: lookup by exact eav.
+        // Second is slower, but still only one table walk: lookup old value by ea.
+        let s = r#"


Would it help with clarity/size to pull out these SQL strings into it's own module as well?

Generally I have found not, since it adds a level of indirection. But Firefox does this for some modules, including Places IIRC. I'd like to leave them more or unless inline until they're more concrete, or truly unmanageable.

Most of Firefox on all platforms keeps queries inline in code. Reasons:

They're almost never reused. If they are, it makes the code hard to change later — the "beginners copy and paste; professionals refactor; veterans copy and paste" thing.

Keeping them elsewhere makes it harder to understand the code, losing context.

It makes it harder to delete unused queries if they're divorced from their calling code.

ncalexan · 2017-02-02T01:51:11Z

        vec![(":db.part/db", 0, (1 + V2_IDENTS.len()) as i64),
-             (":db.part/user", 0x10000, 0x10000),
-             (":db.part/tx", 0x10000000, 0x10000000),
+             (":db.part/user", TX0, TX0),


Oops! :db.part/user should remain untouched. Fixed locally.

This is slightly simpler re-expression of the existing Clojure implementation.

rnewman

Incomplete review. More tomorrow.

rnewman · 2017-02-02T02:42:56Z

 lazy_static = "0.2.2"
 # TODO: don't depend on num and ordered-float; expose helpers in edn abstracting necessary constructors.
 num = "0.1.35"
 ordered-float = "0.3.0"


I just bumped this to 0.4.0 in a6659ae. You might want to do the same if this lands after.

rnewman · 2017-02-02T03:13:03Z

 use rusqlite::types::{ToSql, ToSqlOutput};
+use time;

+use {repeat_values, to_namespaced_keyword};


What does this use syntax mean?!

It's equivalent to use ::{SYMBOL}, which I have replaced it with.

rnewman

More progress!

rnewman · 2017-02-03T17:39:35Z

+    pub fn resolve_avs<'a>(&self, conn: &rusqlite::Connection, avs: &'a [AVPair]) -> Result<AVMap<'a>> {
+        // Start search_id's at some identifiable number.
+        let initial_search_id = 2000;
+        let values_per_statement = 4;


bindings_per_statement

rnewman · 2017-02-03T17:40:09Z

+        // produce the map [a v] -> e.
+        //
+        // TODO: `collect` into a HashSet so that any (a, v) is resolved at most once.
+        let chunks: itertools::IntoChunks<_> = avs.into_iter().enumerate().chunks(::SQLITE_MAX_VARIABLE_NUMBER / 4);


/ bindings_per_statement

rnewman · 2017-02-03T17:43:26Z

+            }).collect();
+
+            // TODO: cache these statements for selected values of `count`.
+            let values: String = repeat_values(values_per_statement, count);


Before this line:

assert!((values_per_statement * count) < ::SQLITE_MAX_VARIABLE_NUMBER);

This is asserted in repeat_values, so it's in place for all consumers.

rnewman · 2017-02-03T17:44:34Z

+            let values: String = repeat_values(values_per_statement, count);
+            let s: String = format!("WITH t(search_id, a, v, value_type_tag) AS (VALUES {}) SELECT t.search_id, d.e \
+                                     FROM t, all_datoms AS d \
+                                     WHERE d.index_avet IS NOT 0 AND d.a = t.a AND d.value_type_tag = t.value_type_tag AND d.v = t.v", values);


Put values on a line by itself. It's wider than GitHub can show, and it's an important thing!

Add a TODO about using something other than all_datoms. We know all the as, so we can turn this into two queries against the individual datoms tables, connected with a UNION ALL. In the common case, where most unique attributes will not be fulltext-indexed, we'll be querying just datoms.

rnewman · 2017-02-03T17:49:26Z

+        let results: Vec<(i64, Entid)> = results?.as_slice().concat();
+
+        // Create map [a v] -> e.
+        let m: HashMap<&'a AVPair, Entid> = results.into_iter().map(|(search_id, entid)| {


Can we not simply accumulate directly into a mutable HashMap as we walk the chunk iter? Seems silly to collect results into nested vecs, then flatten them, then walk them to produce a map…

Yeah, there's a bunch of ways to phrase this. I want it this way to demonstrate the chunking technique with higher order functions (partly for myself!). I'm going to keep it as it is for now, but I've filed #262 to do better eventually.

rnewman · 2017-02-03T18:12:34Z

-        let x: Result<Vec<()>> = r.into_iter().collect();
-        x.map(|_| ())
+        let r: Result<Vec<()>> = r.into_iter().collect();
+        r?;


Sadly, the "simplification" is the turbofish (::<>) which is even harder to read.

rnewman · 2017-02-03T18:16:02Z

+///
+/// The datom set returned does not include any datoms of the form [... :db/txInstant ...].
+pub fn datoms_after(conn: &rusqlite::Connection, db: &DB, tx: i64) -> Result<Datoms> {
+    let mut stmt: rusqlite::Statement = conn.prepare("SELECT e, a, v, value_type_tag, tx FROM datoms WHERE tx > ? ORDER BY e ASC, a ASC, v ASC, tx ASC")?;


WHERE tx > ? AND a IS NOT ?…", &[&tx, entids::DB_TX_INSTANT]? Might as well avoid all of those rows…

I thought about this, and decided to do it as post-processing. There's other rewriting happening and I think the final form of these test functions might look a lot like a pattern matcher on edn::Values, in which case I don't want to filter at all.

This is really just a matter of taste, and I want to be as close to the actual DB contents in memory as I can be -- for now.

rnewman · 2017-02-03T18:16:14Z

+        let a: i64 = row.get_checked(1)?;
+
+        if a == entids::DB_TX_INSTANT {
+            return Ok(None);


… so you don't need to do this.

It also occurs to me: we had better make sure that users never assign a new entity ID to any of the builtins!

Yes, although if you footgun yourself you footgun yourself :) I think the route to only letting the transactor allocate entids is via #190.

rnewman · 2017-02-03T18:17:30Z

+        Ok(Some(Datom {
+            e: to_entid(db, e),
+            a: to_entid(db, a),
+            v: value,


I wonder if it's worth keeping the TypedValue wrapper here for v. The compiler should ensure that it's efficient, and without it we're strictly losing some convenient info…

This is true, but the real goal of these debugging routines is to compare to EDN, c.f. #188. With that in mind, I'm not going to expose TypedValue just yet. (Now, there's TypedValue::to_edn_value_pair already, which could be made Into<edn::Value> pretty easily, so this might not be much work, but let's see if there's frustration mapping a to Attribute before we do that work.)

rnewman · 2017-02-03T18:19:39Z

+        }))
+    })?.collect();
+
+    Ok(Datoms(r?.into_iter().filter_map(|x| x).collect()))


I feel there must be a simpler alternative to filter_map(|x| x) — perhaps don't collect the failures? Use filter_map earlier? Dunno.

There might be a simpler way, but I don't know it. Something has to collect() the inner Result instances into an outer Result before we can filter_map. It might be that for loops are better in some situations -- but I reach for the higher order functions first.

rnewman · 2017-02-03T20:10:15Z

+        let a: i64 = row.get_checked(1)?;
+
+        if a == entids::DB_TX_INSTANT {
+            return Ok(None);


It also occurs to me: we had better make sure that users never assign a new entity ID to any of the builtins!

rnewman · 2017-02-03T20:10:47Z

-            a: to_entid(a),
+            e: to_entid(db, e),
+            a: to_entid(db, a),
            v: value,


Same comment about TypedValue.

rnewman · 2017-02-03T20:12:33Z


 use edn::symbols;

+pub const SQLITE_MAX_VARIABLE_NUMBER: usize = 999;


This is a compile-time option in SQLite. We should upstream a patch to rusqlite.

Note that SQLITE_MAX_VARIABLE_NUMBER is 500,000 on Mac OS!

We can get this value at runtime by calling sqlite3_limit(db, 9, -1), where 9 is SQLITE_LIMIT_VARIABLE_NUMBER. I don't see that LIMIT constant exported by rusqlite.

This is the correct thing to do. Using S_M_V_N is expedient but won't detect a lower runtime limit imposed by sqlite3_limit.

Filed https://github.com/jgallagher/rusqlite/issues/220 to expose S_M_V_N.

Alternatively, we could consider looping over values and firing off a prepared statement over and over. I don't know which is slower: constructing the concatenated string and running one query with 999 variables, or running a small prepared statement 250 times within a transaction!

rnewman · 2017-02-03T20:52:42Z

+    // Like "(?, ?, ?)".
+    let inner = format!("({})", repeat("?").take(values_per_tuple).join(", "));
+    // Like "(?, ?, ?), (?, ?, ?)".
+    let values: String = repeat(inner).take(tuples).join(", ");


I know it's a huge pain in the ass, but I'm really interested to see the tradeoffs between using a fixed static string and a single prepared statement called a few times, perhaps with a mutable values array — i.e., no allocations for each write — versus doing all of this allocation in repeat_values and doing a single SQLite library call.

Can you informally measure?

There's a ton involved in doing this, and it's not time yet. I've filed #263 to track this for real. This really shows when you try to import a Places database, which will be ... a while.

…/issues/211.

This is follow-up to earlier work. Turn TypedValue::Keyword into edn::Value::NamespacedKeyword. Don't take a reference to value_type_tag.

Requires itertools, so this commit is not stand-alone.

This is handy for testing.

This is slightly simpler re-expression of the existing Clojure implementation.

jaredhirsch added the in progress label Jan 30, 2017

ncalexan force-pushed the db-tests branch 3 times, most recently from 1cf9ccf to ec06fdc Compare February 1, 2017 18:48

ncalexan mentioned this pull request Feb 1, 2017

[tx] Compress flags on insertion and have SQLite expand internally #226

Closed

ncalexan force-pushed the db-tests branch from ec06fdc to 64ff2ef Compare February 1, 2017 23:24

ncalexan added a commit to ncalexan/mentat that referenced this pull request Feb 1, 2017

Start implementing bulk SQL insertion algorithms. (mozilla#214)

3166149

This is slightly simpler re-expression of the existing Clojure implementation.

ncalexan requested review from joewalker, jsantell, rnewman and victorporof February 1, 2017 23:26

jsantell approved these changes Feb 2, 2017

View reviewed changes

ncalexan commented Feb 2, 2017

View reviewed changes

ncalexan added a commit to ncalexan/mentat that referenced this pull request Feb 2, 2017

Start implementing bulk SQL insertion algorithms. (mozilla#214)

bd468a0

This is slightly simpler re-expression of the existing Clojure implementation.

rnewman reviewed Feb 2, 2017

View reviewed changes

rnewman reviewed Feb 3, 2017

View reviewed changes

rnewman approved these changes Feb 3, 2017

View reviewed changes

jsantell mentioned this pull request Feb 6, 2017

Testing bitflag expansion #242

Merged

This was referenced Feb 8, 2017

[meta] Full Datalog type support (instants, UUIDs, URIs) #201

Open

[db] Don't collect as many intermediate data structures when looking up [a v] pairs #262

Open

[db] Optimize bulk insertions using fixed-size statement caches and base expansion #263

Open

ncalexan added 7 commits February 8, 2017 13:25

Pre: Bump rusqlite version for https://github.com/jgallagher/rusqlite…

1088596

…/issues/211.

Pre: Add some value conversion tests.

59c1e5a

This is follow-up to earlier work. Turn TypedValue::Keyword into edn::Value::NamespacedKeyword. Don't take a reference to value_type_tag.

Pre: Use itertools.

d0c5761

Pre: Add repeat_values.

7b77b63

Requires itertools, so this commit is not stand-alone.

Pre: Expose the first transaction ID as bootstrap::TX0.

11e19f3

This is handy for testing.

Pre: Improve debug module.

5e01cfe

Start implementing bulk SQL insertion algorithms. (mozilla#214)

287986e

This is slightly simpler re-expression of the existing Clojure implementation.

ncalexan added 3 commits February 8, 2017 13:42

Post: Start generic data-driven transaction testing. (mozilla#188)

c53bb5a

Review comment: use ::{SYMBOL} instead of use {SYMBOL}.

9017ebc

Review comment: Prefer bindings_per_statement to values_per_statement.

07eda9c

ncalexan force-pushed the db-tests branch from 64ff2ef to 07eda9c Compare February 8, 2017 22:03

ncalexan merged commit afafcd6 into mozilla:rust Feb 8, 2017

jaredhirsch removed the in progress label Feb 8, 2017

ncalexan deleted the db-tests branch February 8, 2017 22:12

This was referenced Feb 9, 2017

[tx] Use better keywords in transaction testing #270

Closed

[tests] Use the EDN matcher for testing transaction output #271

Closed

[tx] Expand TypedValue to include non-namespaced keywords #285

Open


		use edn::symbols;

		pub const SQLITE_MAX_VARIABLE_NUMBER: usize = 999;

Conversation

ncalexan commented Jan 30, 2017

Uh oh!

ncalexan commented Feb 1, 2017

Uh oh!

ncalexan commented Feb 1, 2017

Uh oh!

jsantell left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rnewman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rnewman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!