-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
ENH: add expression evaluation functionality via eval #4164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok....was able to pull/push just fine now.... maybe the original rebase caused me to look at something different? |
|
ok so no rebasing until later...agreed? |
this eval-3393 push to eval-3393 is good (was not there previously). maybe having a different local name was a problem somehow |
|
why don't you try rebasing those last 2 commits into one? |
|
see what happens |
|
ok let's see... |
|
i picked up the change..... so I guess as we don't rebase at the same time or while one is working, maybe ok |
|
going to be hard to coordinate that though |
|
in any event....let's just push for now...can fix commits later you are welcome to create |
|
not as long as we push/pull.....just don't rebase |
|
right yeah that's what i meant |
|
should the numexpr engine allow the |
|
i would make |
|
got a throwing a |
|
hm maybe query for now and indexing later. seems to break a ton of stuff i don't have it down just yet what the rules should be for indexing with something that could be an expression |
|
actually only a few tests were broken by indexing, my refactor seems to have broken the other things |
|
ok.....lmk when I should take a look....I am pretty much done for now....(maybe will start on docs soon) |
|
thanks for the pytables stuff! |
|
it was built on a nice base! |
|
ah shucks, thanks :) |
|
i will support chained comparisons--they aren't a big deal to add...although things will be evaluated twice by e.g., more generally
so inner terms will be evaluated 2 * (N - 2) times something like reducer = lambda x, y: BinOp(op=BitAnd, operands=[Compare(op=node.op, comparators=[x, y])])
reduce(reducer, node.comparators) |
|
i wish github had latex |
|
are u doing this transform directly? I don't think me will mind the extras prob optimize them away anyhow |
|
yes i would do it in the base class essentially build a new node and then visit it similar idea to assignment. compare nodes will be passed over that shouldn't be a big deal for parsing unless you have a ridiculously large chained comparison like 1000 compares or something. |
|
if its easy ok otherwise u can just detect it for now and disallow |
|
@jreback just added the i think indexing should be off limits for now b/c of date strings, since |
|
I agree and I like query / eval better |
|
@jreback resolution is now there. i opted for a |
|
@jreback can any of ur list above be checked off? |
|
both good ideas! |
I'm saying that df['[1,2] in c'] must filter on col c since there is no other column passed. However, it does not follow the pattern of df['a in b'] which filters on col a, the first object before the 'in'. |
|
Hey guys, Sorry it's taken me so long to get to reviewing this. This seems like a really good start on adding this kind of functionality to the library. I feel like we should treat the actual query syntax as experimental in 0.13.x while we solicit some more feedback and see how it plays out in actual use. I have to say though, I'm kind of 👎 on pushing expression evaluation into I'm pretty stretched thin on taking this out for a serious spin but I'm fine with this being merged in and roped off as experimental until we have some time to see what the implications of various design decisions are. Feel free to disagree of course but I think starting a dialogue with the community about this sort of thing in general is a good idea (we've been talking about it for ages...glad you got around to actually doing it) |
which is syntactically equivalent of '='
Also added tests for nan in and not in and disallowed ops like
pd.eval('1 or 2')
since that should be performed in regular Python
Yes. That's correct. I don't see why this is an issue. Why would |
Not an issue; possibly confusing when comparing what col is filtered on with different usages of in. Ex: |
|
Just an FYI, as per @wesm's suggestion I've removed the Right now Otherwise, things work as if you had called |
|
Sounds good. I think the query method will make for much easier reading. 👍 |
ENH: add expression evaluation functionality via eval
|
thanks to @cpcloud for implemented much of this. When it was started, this was a tiny little function! |
|
475 comments! new record! who-hoo |
|
Awesome - great work @cpcloud :) |
closes #3393
closes #2560
cc @jreback
evalRoadmapPyTables
allows natural syntax for queries, with in-line variables allowed
some examples
"ts>=Timestamp('2012-02-01') & users=['a','b','c']"['A > 0']'(index>df.index[3] & index<=df.index[6]) | string="bar"'Todos:
|and~operatorsExprbase class to opt in/out of allowed operations(e.g.
Numexprdoesn't want to allow certain operations, butPyTablesneeds them,so maybe define in the base class for simplicity, and just not allow them (checked in
visit),which right now doesn't allow a
generic_visitor)ops.Valuebe equiv ofops.Constant?querymethodquerydocstring to show use of__getitem__and then add toindexing.rstquerydocstringDataFrame.evalmethodDocumentation
inandnot inevaldocstringx = 3; y = 2; pd.eval('x // y', engine='python')is many times slower than the same operation in actual Python.Engines
pythonengine for testing (basically a restricted eval, should make sure nothing dangerous a laeval('os.system("sudo rm -rf /")')) is possible herenumexprengineParsers
pythonparser is currrently not failing on complex bool ops involving subscripting like it should (potentially couldevalthe compiled AST instead of dispatching to function calls)pandasparser should handle subscripting, i.e.,df.query(...) == pd.eval('df[...]')pandasparserpythonparserpytablesengine and parser (internal)Should python be the default engine for small frames?
Functions
isexprevalto top-level pandasError Handling
expr.py. should fail fast.Alignment
Scope
@asyntax for referencing local variables with the same name as a columnExprobjects to be passed toeval(Exprwill pull in the local and global variables from the calling stack frame)Miscellaneous
MultiIndexqueriesstrops (implemented usingisin)Timestampanddatetimefunctionalitypd.eval('1')orpd.eval('df')df + df2should return aDataFrameifdfanddf2are bothDataFrameobjects.ExpressiontoExprtruedivkeyword toExpr, default toTrueand,orandnotaliases for the operators&,|and~, respectivelydf.AOperators
%**Tests
PeriodIndexobjects are untested at large (a few scattered tests for basic things)df['@awesome_domain'],df['df["$10"]']align.pyonce BUG: fix period index object instantiation when joining with self #4379 is mergedDataFramewith aPeriodIndexin the columnsDatetimeIndexobjects//operator works as it should (it works, but there may be edge cases that I haven't though of yet)//numexpronly supports scalar evaluation here: variables are not allowed so this will always eval in python space (half-assed it a bit in the tests here need to fix that)**associativity issue in testingSyntax (Easier)
in,not in(was thinkingimplemented asa in b->b.isin(a)anda not in b->not a in b->~b.isin(a))Syntax (More difficult)
sumandprod)numexprsupportsfunction calls(we're not building a Python interpreter here 😉)Optimizations
There's a quite a bit of recursion going on here, where maybe something stack-based might be more efficient. For example, an expression like
df['f in g and a < b < (c ** 2 + b ** 2 - a ** 3) and d in b not in e']will make 3 recursive calls to eval (not including the top level call). One for the firstinexpression and two for thein/not inchain. A stack-based implementation of the parser (instead of recursively visiting nodes) could possible shave off a few milliseconds. Not sure if this is worth it though.Others
alignment forGeneric alignment function and new module core.align #4336NDFramereplace currentevaluatefunction