[tx] Start implementing bulk SQL insertion algorithms#214
Conversation
1cf9ccf to
ec06fdc
Compare
This is slightly simpler re-expression of the existing Clojure implementation.
|
This is ready for review. As written it doesn't quite match the Wiki, but the spirit is identical. I'll try to expand on the approach and patch the Wiki as I go along. |
|
I've flagged the world for review, so lay it on me, folks. |
jsantell
left a comment
There was a problem hiding this comment.
Some comments, I cannot speak for correctness however
| # TODO: don't depend on num and ordered-float; expose helpers in edn abstracting necessary constructors. | ||
| num = "0.1.35" | ||
| ordered-float = "0.3.0" | ||
| time = "0.1.35" |
There was a problem hiding this comment.
Is there anyway to keep these deps (and other modules) in sync with the top-level dependencies?
There was a problem hiding this comment.
I'm having this trouble, too.
In theory you can do:
[dependencies.ordered-float]
and it'll use whichever one is already in use. Except that just broke for me on Travis, so instead I have to pin very, very carefully. We just need to rely on compiler errors if we get this wrong.
| Ok(DB::new(partition_map, schema)) | ||
| } | ||
|
|
||
| use itertools; |
There was a problem hiding this comment.
Is it common to have these use and type statements in the middle of a file? Also, in general, this is getting pretty long, maybe break it out into a new file?
There was a problem hiding this comment.
I moved the type definitions to types and lifted the use statements to the top of the file. These were just oversights.
There was a problem hiding this comment.
Also, re: file length -- yes, this is long. I have Grand Plans for a split between the DB layer and the transaction processing layer but I don't want to make the split until more of the transactor is in place. So eventually I expect two files of similar size, but right now it's just one big file. Sorry!
| .chain(once(to_bool_ref(index_avet) as &ToSql) | ||
| .chain(once(to_bool_ref(index_vaet) as &ToSql) | ||
| .chain(once(to_bool_ref(index_fulltext) as &ToSql) | ||
| .chain(once(to_bool_ref(unique_value) as &ToSql)))))))))) |
There was a problem hiding this comment.
I've filed #261 to do better. Please comment there, take, or mentor that ticket!
| fn search(&self, conn: &rusqlite::Connection) -> Result<()> { | ||
| // First is fast, only one table walk: lookup by exact eav. | ||
| // Second is slower, but still only one table walk: lookup old value by ea. | ||
| let s = r#" |
There was a problem hiding this comment.
Would it help with clarity/size to pull out these SQL strings into it's own module as well?
There was a problem hiding this comment.
Generally I have found not, since it adds a level of indirection. But Firefox does this for some modules, including Places IIRC. I'd like to leave them more or unless inline until they're more concrete, or truly unmanageable.
There was a problem hiding this comment.
Most of Firefox on all platforms keeps queries inline in code. Reasons:
- They're almost never reused. If they are, it makes the code hard to change later — the "beginners copy and paste; professionals refactor; veterans copy and paste" thing.
- Keeping them elsewhere makes it harder to understand the code, losing context.
- It makes it harder to delete unused queries if they're divorced from their calling code.
| vec![(":db.part/db", 0, (1 + V2_IDENTS.len()) as i64), | ||
| (":db.part/user", 0x10000, 0x10000), | ||
| (":db.part/tx", 0x10000000, 0x10000000), | ||
| (":db.part/user", TX0, TX0), |
There was a problem hiding this comment.
Oops! :db.part/user should remain untouched. Fixed locally.
This is slightly simpler re-expression of the existing Clojure implementation.
rnewman
left a comment
There was a problem hiding this comment.
Incomplete review. More tomorrow.
| lazy_static = "0.2.2" | ||
| # TODO: don't depend on num and ordered-float; expose helpers in edn abstracting necessary constructors. | ||
| num = "0.1.35" | ||
| ordered-float = "0.3.0" |
There was a problem hiding this comment.
I just bumped this to 0.4.0 in a6659ae. You might want to do the same if this lands after.
| use rusqlite::types::{ToSql, ToSqlOutput}; | ||
| use time; | ||
|
|
||
| use {repeat_values, to_namespaced_keyword}; |
There was a problem hiding this comment.
What does this use syntax mean?!
There was a problem hiding this comment.
It's equivalent to use ::{SYMBOL}, which I have replaced it with.
| pub fn resolve_avs<'a>(&self, conn: &rusqlite::Connection, avs: &'a [AVPair]) -> Result<AVMap<'a>> { | ||
| // Start search_id's at some identifiable number. | ||
| let initial_search_id = 2000; | ||
| let values_per_statement = 4; |
| // produce the map [a v] -> e. | ||
| // | ||
| // TODO: `collect` into a HashSet so that any (a, v) is resolved at most once. | ||
| let chunks: itertools::IntoChunks<_> = avs.into_iter().enumerate().chunks(::SQLITE_MAX_VARIABLE_NUMBER / 4); |
| }).collect(); | ||
|
|
||
| // TODO: cache these statements for selected values of `count`. | ||
| let values: String = repeat_values(values_per_statement, count); |
There was a problem hiding this comment.
Before this line:
assert!((values_per_statement * count) < ::SQLITE_MAX_VARIABLE_NUMBER);
There was a problem hiding this comment.
This is asserted in repeat_values, so it's in place for all consumers.
| let values: String = repeat_values(values_per_statement, count); | ||
| let s: String = format!("WITH t(search_id, a, v, value_type_tag) AS (VALUES {}) SELECT t.search_id, d.e \ | ||
| FROM t, all_datoms AS d \ | ||
| WHERE d.index_avet IS NOT 0 AND d.a = t.a AND d.value_type_tag = t.value_type_tag AND d.v = t.v", values); |
There was a problem hiding this comment.
Put values on a line by itself. It's wider than GitHub can show, and it's an important thing!
There was a problem hiding this comment.
Add a TODO about using something other than all_datoms. We know all the as, so we can turn this into two queries against the individual datoms tables, connected with a UNION ALL. In the common case, where most unique attributes will not be fulltext-indexed, we'll be querying just datoms.
| let results: Vec<(i64, Entid)> = results?.as_slice().concat(); | ||
|
|
||
| // Create map [a v] -> e. | ||
| let m: HashMap<&'a AVPair, Entid> = results.into_iter().map(|(search_id, entid)| { |
There was a problem hiding this comment.
Can we not simply accumulate directly into a mutable HashMap as we walk the chunk iter? Seems silly to collect results into nested vecs, then flatten them, then walk them to produce a map…
There was a problem hiding this comment.
Yeah, there's a bunch of ways to phrase this. I want it this way to demonstrate the chunking technique with higher order functions (partly for myself!). I'm going to keep it as it is for now, but I've filed #262 to do better eventually.
| let x: Result<Vec<()>> = r.into_iter().collect(); | ||
| x.map(|_| ()) | ||
| let r: Result<Vec<()>> = r.into_iter().collect(); | ||
| r?; |
There was a problem hiding this comment.
Sadly, the "simplification" is the turbofish (::<>) which is even harder to read.
| /// | ||
| /// The datom set returned does not include any datoms of the form [... :db/txInstant ...]. | ||
| pub fn datoms_after(conn: &rusqlite::Connection, db: &DB, tx: i64) -> Result<Datoms> { | ||
| let mut stmt: rusqlite::Statement = conn.prepare("SELECT e, a, v, value_type_tag, tx FROM datoms WHERE tx > ? ORDER BY e ASC, a ASC, v ASC, tx ASC")?; |
There was a problem hiding this comment.
WHERE tx > ? AND a IS NOT ?…", &[&tx, entids::DB_TX_INSTANT]? Might as well avoid all of those rows…
There was a problem hiding this comment.
I thought about this, and decided to do it as post-processing. There's other rewriting happening and I think the final form of these test functions might look a lot like a pattern matcher on edn::Values, in which case I don't want to filter at all.
This is really just a matter of taste, and I want to be as close to the actual DB contents in memory as I can be -- for now.
| let a: i64 = row.get_checked(1)?; | ||
|
|
||
| if a == entids::DB_TX_INSTANT { | ||
| return Ok(None); |
There was a problem hiding this comment.
… so you don't need to do this.
There was a problem hiding this comment.
It also occurs to me: we had better make sure that users never assign a new entity ID to any of the builtins!
There was a problem hiding this comment.
Yes, although if you footgun yourself you footgun yourself :) I think the route to only letting the transactor allocate entids is via #190.
| Ok(Some(Datom { | ||
| e: to_entid(db, e), | ||
| a: to_entid(db, a), | ||
| v: value, |
There was a problem hiding this comment.
I wonder if it's worth keeping the TypedValue wrapper here for v. The compiler should ensure that it's efficient, and without it we're strictly losing some convenient info…
There was a problem hiding this comment.
This is true, but the real goal of these debugging routines is to compare to EDN, c.f. #188. With that in mind, I'm not going to expose TypedValue just yet. (Now, there's TypedValue::to_edn_value_pair already, which could be made Into<edn::Value> pretty easily, so this might not be much work, but let's see if there's frustration mapping a to Attribute before we do that work.)
| })) | ||
| })?.collect(); | ||
|
|
||
| Ok(Datoms(r?.into_iter().filter_map(|x| x).collect())) |
There was a problem hiding this comment.
I feel there must be a simpler alternative to filter_map(|x| x) — perhaps don't collect the failures? Use filter_map earlier? Dunno.
There was a problem hiding this comment.
There might be a simpler way, but I don't know it. Something has to collect() the inner Result instances into an outer Result before we can filter_map. It might be that for loops are better in some situations -- but I reach for the higher order functions first.
| let a: i64 = row.get_checked(1)?; | ||
|
|
||
| if a == entids::DB_TX_INSTANT { | ||
| return Ok(None); |
There was a problem hiding this comment.
It also occurs to me: we had better make sure that users never assign a new entity ID to any of the builtins!
| a: to_entid(a), | ||
| e: to_entid(db, e), | ||
| a: to_entid(db, a), | ||
| v: value, |
There was a problem hiding this comment.
Same comment about TypedValue.
|
|
||
| use edn::symbols; | ||
|
|
||
| pub const SQLITE_MAX_VARIABLE_NUMBER: usize = 999; |
There was a problem hiding this comment.
This is a compile-time option in SQLite. We should upstream a patch to rusqlite.
There was a problem hiding this comment.
Note that SQLITE_MAX_VARIABLE_NUMBER is 500,000 on Mac OS!
We can get this value at runtime by calling sqlite3_limit(db, 9, -1), where 9 is SQLITE_LIMIT_VARIABLE_NUMBER. I don't see that LIMIT constant exported by rusqlite.
This is the correct thing to do. Using S_M_V_N is expedient but won't detect a lower runtime limit imposed by sqlite3_limit.
Filed https://github.com/jgallagher/rusqlite/issues/220 to expose S_M_V_N.
Alternatively, we could consider looping over values and firing off a prepared statement over and over. I don't know which is slower: constructing the concatenated string and running one query with 999 variables, or running a small prepared statement 250 times within a transaction!
| // Like "(?, ?, ?)". | ||
| let inner = format!("({})", repeat("?").take(values_per_tuple).join(", ")); | ||
| // Like "(?, ?, ?), (?, ?, ?)". | ||
| let values: String = repeat(inner).take(tuples).join(", "); |
There was a problem hiding this comment.
I know it's a huge pain in the ass, but I'm really interested to see the tradeoffs between using a fixed static string and a single prepared statement called a few times, perhaps with a mutable values array — i.e., no allocations for each write — versus doing all of this allocation in repeat_values and doing a single SQLite library call.
Can you informally measure?
There was a problem hiding this comment.
There's a ton involved in doing this, and it's not time yet. I've filed #263 to track this for real. This really shows when you try to import a Places database, which will be ... a while.
This is follow-up to earlier work. Turn TypedValue::Keyword into edn::Value::NamespacedKeyword. Don't take a reference to value_type_tag.
Requires itertools, so this commit is not stand-alone.
This is handy for testing.
This is slightly simpler re-expression of the existing Clojure implementation.
This will eventually grow to implement https://github.com/mozilla/mentat/wiki/Transacting:-entity-to-SQL-translation.