Skip to content
This repository was archived by the owner on Sep 12, 2018. It is now read-only.

[tx] Start implementing bulk SQL insertion algorithms#214

Merged
ncalexan merged 10 commits into
mozilla:rustfrom
ncalexan:db-tests
Feb 8, 2017
Merged

[tx] Start implementing bulk SQL insertion algorithms#214
ncalexan merged 10 commits into
mozilla:rustfrom
ncalexan:db-tests

Conversation

@ncalexan
Copy link
Copy Markdown
Member

@ncalexan ncalexan force-pushed the db-tests branch 3 times, most recently from 1cf9ccf to ec06fdc Compare February 1, 2017 18:48
ncalexan added a commit to ncalexan/mentat that referenced this pull request Feb 1, 2017
This is slightly simpler re-expression of the existing Clojure
implementation.
@ncalexan
Copy link
Copy Markdown
Member Author

ncalexan commented Feb 1, 2017

This is ready for review. As written it doesn't quite match the Wiki, but the spirit is identical. I'll try to expand on the approach and patch the Wiki as I go along.

@ncalexan
Copy link
Copy Markdown
Member Author

ncalexan commented Feb 1, 2017

I've flagged the world for review, so lay it on me, folks.

Copy link
Copy Markdown
Contributor

@jsantell jsantell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, I cannot speak for correctness however

Comment thread db/Cargo.toml
# TODO: don't depend on num and ordered-float; expose helpers in edn abstracting necessary constructors.
num = "0.1.35"
ordered-float = "0.3.0"
time = "0.1.35"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anyway to keep these deps (and other modules) in sync with the top-level dependencies?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having this trouble, too.

In theory you can do:

[dependencies.ordered-float]

and it'll use whichever one is already in use. Except that just broke for me on Travis, so instead I have to pin very, very carefully. We just need to rely on compiler errors if we get this wrong.

Comment thread db/src/db.rs Outdated
Ok(DB::new(partition_map, schema))
}

use itertools;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it common to have these use and type statements in the middle of a file? Also, in general, this is getting pretty long, maybe break it out into a new file?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type more so than use.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the type definitions to types and lifted the use statements to the top of the file. These were just oversights.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, re: file length -- yes, this is long. I have Grand Plans for a split between the DB layer and the transaction processing layer but I don't want to make the split until more of the transactor is in place. So eventually I expect two files of similar size, but right now it's just one big file. Sorry!

Comment thread db/src/db.rs
.chain(once(to_bool_ref(index_avet) as &ToSql)
.chain(once(to_bool_ref(index_vaet) as &ToSql)
.chain(once(to_bool_ref(index_fulltext) as &ToSql)
.chain(once(to_bool_ref(unique_value) as &ToSql))))))))))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • good
    • golly
      • miss
        • molly

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed #261 to do better. Please comment there, take, or mentor that ticket!

Comment thread db/src/db.rs
fn search(&self, conn: &rusqlite::Connection) -> Result<()> {
// First is fast, only one table walk: lookup by exact eav.
// Second is slower, but still only one table walk: lookup old value by ea.
let s = r#"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it help with clarity/size to pull out these SQL strings into it's own module as well?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I have found not, since it adds a level of indirection. But Firefox does this for some modules, including Places IIRC. I'd like to leave them more or unless inline until they're more concrete, or truly unmanageable.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of Firefox on all platforms keeps queries inline in code. Reasons:

  • They're almost never reused. If they are, it makes the code hard to change later — the "beginners copy and paste; professionals refactor; veterans copy and paste" thing.
  • Keeping them elsewhere makes it harder to understand the code, losing context.
  • It makes it harder to delete unused queries if they're divorced from their calling code.

Comment thread db/src/bootstrap.rs Outdated
vec![(":db.part/db", 0, (1 + V2_IDENTS.len()) as i64),
(":db.part/user", 0x10000, 0x10000),
(":db.part/tx", 0x10000000, 0x10000000),
(":db.part/user", TX0, TX0),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! :db.part/user should remain untouched. Fixed locally.

ncalexan added a commit to ncalexan/mentat that referenced this pull request Feb 2, 2017
This is slightly simpler re-expression of the existing Clojure
implementation.
Copy link
Copy Markdown
Collaborator

@rnewman rnewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete review. More tomorrow.

Comment thread db/Cargo.toml Outdated
lazy_static = "0.2.2"
# TODO: don't depend on num and ordered-float; expose helpers in edn abstracting necessary constructors.
num = "0.1.35"
ordered-float = "0.3.0"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just bumped this to 0.4.0 in a6659ae. You might want to do the same if this lands after.

Comment thread db/src/db.rs Outdated
use rusqlite::types::{ToSql, ToSqlOutput};
use time;

use {repeat_values, to_namespaced_keyword};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this use syntax mean?!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's equivalent to use ::{SYMBOL}, which I have replaced it with.

Copy link
Copy Markdown
Collaborator

@rnewman rnewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More progress!

Comment thread db/src/db.rs Outdated
pub fn resolve_avs<'a>(&self, conn: &rusqlite::Connection, avs: &'a [AVPair]) -> Result<AVMap<'a>> {
// Start search_id's at some identifiable number.
let initial_search_id = 2000;
let values_per_statement = 4;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bindings_per_statement

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread db/src/db.rs Outdated
// produce the map [a v] -> e.
//
// TODO: `collect` into a HashSet so that any (a, v) is resolved at most once.
let chunks: itertools::IntoChunks<_> = avs.into_iter().enumerate().chunks(::SQLITE_MAX_VARIABLE_NUMBER / 4);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ bindings_per_statement

Comment thread db/src/db.rs Outdated
}).collect();

// TODO: cache these statements for selected values of `count`.
let values: String = repeat_values(values_per_statement, count);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this line:

assert!((values_per_statement * count) < ::SQLITE_MAX_VARIABLE_NUMBER);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is asserted in repeat_values, so it's in place for all consumers.

Comment thread db/src/db.rs Outdated
let values: String = repeat_values(values_per_statement, count);
let s: String = format!("WITH t(search_id, a, v, value_type_tag) AS (VALUES {}) SELECT t.search_id, d.e \
FROM t, all_datoms AS d \
WHERE d.index_avet IS NOT 0 AND d.a = t.a AND d.value_type_tag = t.value_type_tag AND d.v = t.v", values);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put values on a line by itself. It's wider than GitHub can show, and it's an important thing!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a TODO about using something other than all_datoms. We know all the as, so we can turn this into two queries against the individual datoms tables, connected with a UNION ALL. In the common case, where most unique attributes will not be fulltext-indexed, we'll be querying just datoms.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done both.

Comment thread db/src/db.rs Outdated
let results: Vec<(i64, Entid)> = results?.as_slice().concat();

// Create map [a v] -> e.
let m: HashMap<&'a AVPair, Entid> = results.into_iter().map(|(search_id, entid)| {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not simply accumulate directly into a mutable HashMap as we walk the chunk iter? Seems silly to collect results into nested vecs, then flatten them, then walk them to produce a map…

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's a bunch of ways to phrase this. I want it this way to demonstrate the chunking technique with higher order functions (partly for myself!). I'm going to keep it as it is for now, but I've filed #262 to do better eventually.

Comment thread db/src/db.rs
let x: Result<Vec<()>> = r.into_iter().collect();
x.map(|_| ())
let r: Result<Vec<()>> = r.into_iter().collect();
r?;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, the "simplification" is the turbofish (::<>) which is even harder to read.

Comment thread db/src/debug.rs
///
/// The datom set returned does not include any datoms of the form [... :db/txInstant ...].
pub fn datoms_after(conn: &rusqlite::Connection, db: &DB, tx: i64) -> Result<Datoms> {
let mut stmt: rusqlite::Statement = conn.prepare("SELECT e, a, v, value_type_tag, tx FROM datoms WHERE tx > ? ORDER BY e ASC, a ASC, v ASC, tx ASC")?;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHERE tx > ? AND a IS NOT ?…", &[&tx, entids::DB_TX_INSTANT]? Might as well avoid all of those rows…

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, and decided to do it as post-processing. There's other rewriting happening and I think the final form of these test functions might look a lot like a pattern matcher on edn::Values, in which case I don't want to filter at all.

This is really just a matter of taste, and I want to be as close to the actual DB contents in memory as I can be -- for now.

Comment thread db/src/debug.rs
let a: i64 = row.get_checked(1)?;

if a == entids::DB_TX_INSTANT {
return Ok(None);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… so you don't need to do this.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also occurs to me: we had better make sure that users never assign a new entity ID to any of the builtins!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, although if you footgun yourself you footgun yourself :) I think the route to only letting the transactor allocate entids is via #190.

Comment thread db/src/debug.rs
Ok(Some(Datom {
e: to_entid(db, e),
a: to_entid(db, a),
v: value,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's worth keeping the TypedValue wrapper here for v. The compiler should ensure that it's efficient, and without it we're strictly losing some convenient info…

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true, but the real goal of these debugging routines is to compare to EDN, c.f. #188. With that in mind, I'm not going to expose TypedValue just yet. (Now, there's TypedValue::to_edn_value_pair already, which could be made Into<edn::Value> pretty easily, so this might not be much work, but let's see if there's frustration mapping a to Attribute before we do that work.)

Comment thread db/src/debug.rs
}))
})?.collect();

Ok(Datoms(r?.into_iter().filter_map(|x| x).collect()))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel there must be a simpler alternative to filter_map(|x| x) — perhaps don't collect the failures? Use filter_map earlier? Dunno.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a simpler way, but I don't know it. Something has to collect() the inner Result instances into an outer Result before we can filter_map. It might be that for loops are better in some situations -- but I reach for the higher order functions first.

Comment thread db/src/debug.rs
let a: i64 = row.get_checked(1)?;

if a == entids::DB_TX_INSTANT {
return Ok(None);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also occurs to me: we had better make sure that users never assign a new entity ID to any of the builtins!

Comment thread db/src/debug.rs
a: to_entid(a),
e: to_entid(db, e),
a: to_entid(db, a),
v: value,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about TypedValue.

Comment thread db/src/lib.rs

use edn::symbols;

pub const SQLITE_MAX_VARIABLE_NUMBER: usize = 999;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a compile-time option in SQLite. We should upstream a patch to rusqlite.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that SQLITE_MAX_VARIABLE_NUMBER is 500,000 on Mac OS!

We can get this value at runtime by calling sqlite3_limit(db, 9, -1), where 9 is SQLITE_LIMIT_VARIABLE_NUMBER. I don't see that LIMIT constant exported by rusqlite.

This is the correct thing to do. Using S_M_V_N is expedient but won't detect a lower runtime limit imposed by sqlite3_limit.

Filed https://github.com/jgallagher/rusqlite/issues/220 to expose S_M_V_N.

Alternatively, we could consider looping over values and firing off a prepared statement over and over. I don't know which is slower: constructing the concatenated string and running one query with 999 variables, or running a small prepared statement 250 times within a transaction!

Comment thread db/src/lib.rs
// Like "(?, ?, ?)".
let inner = format!("({})", repeat("?").take(values_per_tuple).join(", "));
// Like "(?, ?, ?), (?, ?, ?)".
let values: String = repeat(inner).take(tuples).join(", ");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's a huge pain in the ass, but I'm really interested to see the tradeoffs between using a fixed static string and a single prepared statement called a few times, perhaps with a mutable values array — i.e., no allocations for each write — versus doing all of this allocation in repeat_values and doing a single SQLite library call.

Can you informally measure?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a ton involved in doing this, and it's not time yet. I've filed #263 to track this for real. This really shows when you try to import a Places database, which will be ... a while.

This is follow-up to earlier work.  Turn TypedValue::Keyword into
edn::Value::NamespacedKeyword.  Don't take a reference to
value_type_tag.
Requires itertools, so this commit is not stand-alone.
This is slightly simpler re-expression of the existing Clojure
implementation.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants