ENH: add expression evaluation functionality via eval #4164

cpcloud · 2013-07-08T20:32:22Z

closes #3393
closes #2560

cc @jreback

`eval` Roadmap

PyTables

allows natural syntax for queries, with in-line variables allowed

some examples
"ts>=Timestamp('2012-02-01') & users=['a','b','c']"

['A > 0']

'(index>df.index[3] & index<=df.index[6]) | string="bar"'

dates=pd.date_range('20130101',periods=5)
'index>dates'

Todos:

Release Notes
v0.13.txt examples (and link to docs)
update docs / need new examples
API changes (disallow dict, show deprecations for separated expressions)
tests for | and ~ operators
tests for invalid filter expressions
code clean up, maybe create new Expr base class to opt in/out of allowed operations
(e.g. Numexpr doesn't want to allow certain operations, but PyTables needs them,
so maybe define in the base class for simplicity, and just not allow them (checked in visit),
which right now doesn't allow a generic_visitor)
can ops.Value be equiv of ops.Constant?

`query` method

a few more doc examples
add a section in query docstring to show use of __getitem__ and then add to indexing.rst
query docstring

`DataFrame.eval` method

implement query in terms of this
tests

Documentation

RELEASE NOTES
v0.13.txt examples & link to docs
document in and not in
enhancing performance section
eval docstring
Warn users that this is not a magic bullet
- E.g., x = 3; y = 2; pd.eval('x // y', engine='python') is many times slower than the same operation in actual Python.

Engines

python engine for testing (basically a restricted eval, should make sure nothing dangerous a la eval('os.system("sudo rm -rf /")')) is possible here
numexpr engine

Parsers

python parser is currrently not failing on complex bool ops involving subscripting like it should (potentially could eval the compiled AST instead of dispatching to function calls)
preparse should handle subscripting
pandas parser should handle subscripting, i.e., df.query(...) == pd.eval('df[...]')
pandas parser
python parser
pytables engine and parser (internal)

Should python be the default engine for small frames?

Functions

toplevel isexpr
move eval to top-level pandas

Error Handling

syntax error handling
name error handling
ugly stack overflow when an expression is not in frame...issue with variable resolution in expr.py. should fail fast.

Alignment

Algorithm
1. flatten the tree of terms
2. resolve the variables, special casing only numpy arrays/scalars/single unary operations (probably room for more abstraction here)
3. align by joining on the indices using the current broadcasting behavior
4. update the original namespace with the newly aligned version of the underlying numpy array
5. reconstruct the final object with the aforementioned joined indices and type
unary alignment
binary alignment

Scope

add @a syntax for referencing local variables with the same name as a column
deal with global scoping issues (not sure if this is tested well enough...which means probably not)
allow Expr objects to be passed to eval (Expr will pull in the local and global variables from the calling stack frame)

Miscellaneous

Operators

%
**
unary ops don't play well when they are operands of binary ops

Tests

Syntax (Easier)

handle in, not in ( ~~was thinking~~ implemented as a in b -> b.isin(a) and a not in b -> not a in b -> ~b.isin(a))

Syntax (More difficult)

reductions (sum and prod)
math ops (sin, cos, tan, sqrt, etc.), i.e., whatever numexpr supports
~~function calls~~ (we're not building a Python interpreter here 😉)

Optimizations

There's a quite a bit of recursion going on here, where maybe something stack-based might be more efficient. For example, an expression like df['f in g and a < b < (c ** 2 + b ** 2 - a ** 3) and d in b not in e'] will make 3 recursive calls to eval (not including the top level call). One for the first in expression and two for the in/not in chain. A stack-based implementation of the parser (instead of recursively visiting nodes) could possible shave off a few milliseconds. Not sure if this is worth it though.

Others

~~alignment for NDFrame~~ Generic alignment function and new module core.align #4336
~~replace current evaluate function~~

jreback · 2013-07-08T20:37:48Z

ok....was able to pull/push just fine now....

maybe the original rebase caused me to look at something different?

cpcloud · 2013-07-08T20:38:47Z

ok so no rebasing until later...agreed?

jreback · 2013-07-08T20:39:11Z

[sheep-jreback-~/pandas] git remote show origin
* remote origin
  Fetch URL: git://github.com/pydata/pandas.git
  Push  URL: git://github.com/pydata/pandas.git
  HEAD branch: master
  Remote branches:
    PR_json-normalize-proto                    tracked
    add-msgpack                                tracked
    eval-3393                                  tracked
    index-not-ndarray                          tracked
    master                                     tracked
    range-index                                tracked
    refs/remotes/origin/PR-test-3503           stale (use 'git remote prune' to remove)
    refs/remotes/origin/applymap               stale (use 'git remote prune' to remove)
    refs/remotes/origin/c-parser               stale (use 'git remote prune' to remove)
    refs/remotes/origin/cython-refactor        stale (use 'git remote prune' to remove)
    refs/remotes/origin/datetime64-inference   stale (use 'git remote prune' to remove)
    refs/remotes/origin/eval3                  stale (use 'git remote prune' to remove)
    refs/remotes/origin/eval4                  stale (use 'git remote prune' to remove)
    refs/remotes/origin/groupby-apply-refactor stale (use 'git remote prune' to remove)
    refs/remotes/origin/large-excel-files      stale (use 'git remote prune' to remove)
    refs/remotes/origin/maintenance/0.7.x      stale (use 'git remote prune' to remove)
    refs/remotes/origin/maintenance/0.9.x      stale (use 'git remote prune' to remove)
    refs/remotes/origin/meta                   stale (use 'git remote prune' to remove)
    refs/remotes/origin/modes                  stale (use 'git remote prune' to remove)
    refs/remotes/origin/nanoseconds            stale (use 'git remote prune' to remove)
    refs/remotes/origin/period-unit-change     stale (use 'git remote prune' to remove)
    refs/remotes/origin/string-methods         stale (use 'git remote prune' to remove)
    refs/remotes/origin/timezone-fixes         stale (use 'git remote prune' to remove)
    refs/remotes/origin/tseries-plotting       stale (use 'git remote prune' to remove)
    refs/remotes/origin/unicode-fix            stale (use 'git remote prune' to remove)
    refs/remotes/origin/unsafe-assign-dtype    stale (use 'git remote prune' to remove)
  Local branches configured for 'git pull':
    eval-3393 rebases onto remote eval-3393
    master    rebases onto remote master
  Local refs configured for 'git push':
    eval-3393 pushes to eval-3393 (up to date)
    master    pushes to master    (up to date)

this eval-3393 push to eval-3393 is good (was not there previously).

maybe having a different local name was a problem somehow

jreback · 2013-07-08T20:39:30Z

why don't you try rebasing those last 2 commits into one?

jreback · 2013-07-08T20:39:41Z

see what happens

cpcloud · 2013-07-08T20:40:05Z

ok let's see...

jreback · 2013-07-08T20:43:04Z

i picked up the change.....

so I guess as we don't rebase at the same time or while one is working, maybe ok

cpcloud · 2013-07-08T20:43:31Z

going to be hard to coordinate that though

jreback · 2013-07-08T20:43:38Z

in any event....let's just push for now...can fix commits later

you are welcome to create BaseExprVisitor or somesuch if you want

jreback · 2013-07-08T20:43:51Z

not as long as we push/pull.....just don't rebase

cpcloud · 2013-07-08T20:44:12Z

right yeah that's what i meant

cpcloud · 2013-07-08T21:31:42Z

should the numexpr engine allow the = syntax for equality? i would tend toward making the smallest possible superset of syntax allowed preferable there's nothing extra, except later maybe and or not and we change the precedence of bitwise operators...so i would only allow = to replace == in pytables expressions

jreback · 2013-07-08T21:49:25Z

i would make = raise in Numexpr as maybe someday can implement it; for now should be an error I think

cpcloud · 2013-07-09T00:08:28Z

got a query method going and df['b + c < a + b']-style indexing going + refactor the base visitor class

throwing a NameError in some tests in the expression in getitem, need to figure out why rather than just catching and then passing to the usual functionality (which works, but i need to know why)

cpcloud · 2013-07-09T00:14:47Z

hm maybe query for now and indexing later. seems to break a ton of stuff i don't have it down just yet what the rules should be for indexing with something that could be an expression

cpcloud · 2013-07-09T00:22:40Z

actually only a few tests were broken by indexing, my refactor seems to have broken the other things

jreback · 2013-07-09T00:39:27Z

ok.....lmk when I should take a look....I am pretty much done for now....(maybe will start on docs soon)

cpcloud · 2013-07-09T01:03:50Z

thanks for the pytables stuff!

jreback · 2013-07-09T01:12:20Z

it was built on a nice base!

cpcloud · 2013-07-09T01:13:56Z

ah shucks, thanks :)

cpcloud · 2013-07-09T02:12:45Z

i will support chained comparisons--they aren't a big deal to add...although things will be evaluated twice by numexpr.

e.g., a < b < c -> a < b & b < c

more generally

x1 < x2 < ... < xN -> x1 < x2 & x2 < x3 & ... & xN-2 < xN-1 & xN-1 < xN

so inner terms will be evaluated 2 * (N - 2) times

something like

reducer = lambda x, y: BinOp(op=BitAnd, operands=[Compare(op=node.op, comparators=[x, y])])
reduce(reducer, node.comparators)

cpcloud · 2013-07-09T02:13:55Z

i wish github had latex

jreback · 2013-07-09T02:17:41Z

are u doing this transform directly?

I don't think me will mind the extras prob optimize them away anyhow

cpcloud · 2013-07-09T02:30:02Z

yes i would do it in the base class

essentially build a new node and then visit it similar idea to assignment.

compare nodes will be passed over N + 2 * (N - 2) + 2 times i believe (where N is the number of operands in the chain), since there will be a loop over each comparator in the "real" chained comparison parse, that's the first N then when compares are parsed again into a BinOp tree the first and last operand are evaluated once that's 2 and each of the N - 2 "inner" operands is evaluated twice. i think that's right...

that shouldn't be a big deal for parsing unless you have a ridiculously large chained comparison like 1000 compares or something.

jreback · 2013-07-09T02:33:56Z

if its easy ok

otherwise u can just detect it for now and disallow

cpcloud · 2013-07-09T02:52:33Z

@jreback just added the query method, could probably use some more tests

i think indexing should be off limits for now b/c of date strings, since df['1/1/2001'] could be division or a date string index, unless u can think of a way to get around that...

jreback · 2013-07-09T02:54:05Z

I agree

and I like query / eval better

cpcloud · 2013-07-09T03:51:27Z

@jreback resolution is now there. i opted for a dict-like object to be passed around and have its get method called at name resolution time. that way any object implementing a dict-like interface can be used, which is consistent with how Python itself does name resolution

cpcloud · 2013-07-09T16:35:22Z

@jreback can any of ur list above be checked off?

cpcloud · 2013-09-15T03:44:28Z

both good ideas!

dragoljub · 2013-09-15T19:42:49Z

@cpcloud

How would you even filter on any other column than c if you just pas a list of values?

I don't understand what you're asking.

I'm saying that df['[1,2] in c'] must filter on col c since there is no other column passed. However, it does not follow the pattern of df['a in b'] which filters on col a, the first object before the 'in'.

wesm · 2013-09-15T23:13:45Z

Hey guys,

Sorry it's taken me so long to get to reviewing this. This seems like a really good start on adding this kind of functionality to the library. I feel like we should treat the actual query syntax as experimental in 0.13.x while we solicit some more feedback and see how it plays out in actual use.

I have to say though, I'm kind of 👎 on pushing expression evaluation into DataFrame.__getitem__. It seems like a bit too much magic, but also something that can be added but not removed. I'd maybe relegate this functionality to an eval or query method for the time being and see how things go. You're saving yourself so much work in many cases by writing the expression that doing a method call doesn't seem that onerous.

I'm pretty stretched thin on taking this out for a serious spin but I'm fine with this being merged in and roped off as experimental until we have some time to see what the implications of various design decisions are. Feel free to disagree of course but I think starting a dialogue with the community about this sort of thing in general is a good idea (we've been talking about it for ages...glad you got around to actually doing it)

which is syntactically equivalent of '='

Also added tests for nan in and not in and disallowed ops like pd.eval('1 or 2') since that should be performed in regular Python

cpcloud · 2013-09-16T18:15:11Z

@dragoljub

I'm saying that df['[1,2] in c'] must filter on col c since there is no other column passed. However, it does not follow the pattern of df['a in b'] which filters on col a, the first object before the 'in'.

Yes. That's correct. I don't see why this is an issue. Why would df.query('[1, 2] in c') filter on anything but column c?

dragoljub · 2013-09-16T18:35:37Z

@cpcloud

Yes. That's correct. I don't see why this is an issue.

Not an issue; possibly confusing when comparing what col is filtered on with different usages of in.

Ex:
df['[1, 2] in c'] ==> Filter on col c (the object after 'in').
vs.
df['a in b'] ==> Filter on col a (the object before 'in').
vs.
df['c in [1, 2]'] ==> Still have to filter on col c? (the object before 'in').

cpcloud · 2013-09-16T18:44:29Z

@dragoljub

Just an FYI, as per @wesm's suggestion I've removed the df[expression] syntax.

Right now df.query('[1, 2] in c') will do the same thing as df.query('c in [1, 2]') (tuples too). These semantics are subject to change and of course I'm open to suggestions.

Otherwise, things work as if you had called left_side_of_in.isin(right_side_of_in).

dragoljub · 2013-09-16T18:46:43Z

@cpcloud

Sounds good. I think the query method will make for much easier reading. 👍

ENH: add expression evaluation functionality via eval

jreback · 2013-09-16T19:17:56Z

thanks to @cpcloud for implemented much of this. When it was started, this was a tiny little function!

jreback · 2013-09-16T19:18:41Z

475 comments! new record! who-hoo

jtratner · 2013-09-17T03:29:44Z

Awesome - great work @cpcloud :)

ghost assigned cpcloud Jul 9, 2013

cpcloud and others added 14 commits September 16, 2013 10:38

ENH: add new computation module and toplevel eval function

9042339

ENH: rewrite assignment operator as equal comparison for PyTables

e10a550

ENH: add Expr based terms for pytables

0b1b2fc

ENH: add automatic query via frame.__getitem__

b35406a

ENH: add boolean operators

9012e5e

ENH: add chained comparisons

4093cd9

ENH: add subscripting to eval expressions

4cbfdc2

ENH: added interpretation of 'in' in pytables Term syntax

3ad3c16

which is syntactically equivalent of '='

ENH: add local scoping with @lcl_variable syntax

0160225

ENH: add DataFrame.eval method

cca0173

ENH: add in, not in, and string/list query support

0d8997a

BUG: parsing of timedelta selection syntax needed correction

0469fe4

CLN: correct in and not in

b8a3ba3

Also added tests for nan in and not in and disallowed ops like pd.eval('1 or 2') since that should be performed in regular Python

CLN: remove __getitem__ queries

e55fa07

cpcloud added 2 commits September 16, 2013 14:23

CLN/DOC: remove __getitem__ from docs

50502fb

CLN: remove expr maybe_expression

ab60f4b

cpcloud added a commit that referenced this pull request Sep 16, 2013

Merge pull request #4164 from pydata/eval-3393

bf1b8ca

ENH: add expression evaluation functionality via eval

cpcloud merged commit bf1b8ca into master Sep 16, 2013

jreback mentioned this pull request Dec 16, 2013

Need syntax sugar for dataframe row selection #2035

Closed

jreback mentioned this pull request Dec 12, 2015

Option to suppress automatic conversion of tuples to MultiIndex #11799

Closed

Uh oh!

ENH: add expression evaluation functionality via eval #4164

ENH: add expression evaluation functionality via eval #4164

Uh oh!

Conversation

cpcloud commented Jul 8, 2013

eval Roadmap

PyTables

query method

DataFrame.eval method

Documentation

Engines

Parsers

Functions

Error Handling

Alignment

Scope

Miscellaneous

Operators

Tests

Syntax (Easier)

Syntax (More difficult)

Optimizations

Others

Uh oh!

jreback commented Jul 8, 2013

Uh oh!

cpcloud commented Jul 8, 2013

Uh oh!

jreback commented Jul 8, 2013

Uh oh!

jreback commented Jul 8, 2013

Uh oh!

jreback commented Jul 8, 2013

Uh oh!

cpcloud commented Jul 8, 2013

Uh oh!

jreback commented Jul 8, 2013

Uh oh!

cpcloud commented Jul 8, 2013

Uh oh!

jreback commented Jul 8, 2013

Uh oh!

jreback commented Jul 8, 2013

Uh oh!

cpcloud commented Jul 8, 2013

Uh oh!

cpcloud commented Jul 8, 2013

Uh oh!

jreback commented Jul 8, 2013

Uh oh!

cpcloud commented Jul 9, 2013

Uh oh!

cpcloud commented Jul 9, 2013

Uh oh!

cpcloud commented Jul 9, 2013

Uh oh!

jreback commented Jul 9, 2013

Uh oh!

cpcloud commented Jul 9, 2013

Uh oh!

jreback commented Jul 9, 2013

Uh oh!

cpcloud commented Jul 9, 2013

Uh oh!

cpcloud commented Jul 9, 2013

Uh oh!

cpcloud commented Jul 9, 2013

Uh oh!

jreback commented Jul 9, 2013

Uh oh!

cpcloud commented Jul 9, 2013

Uh oh!

jreback commented Jul 9, 2013

Uh oh!

cpcloud commented Jul 9, 2013

Uh oh!

jreback commented Jul 9, 2013

Uh oh!

cpcloud commented Jul 9, 2013

`eval` Roadmap

`query` method

`DataFrame.eval` method