Skip to content

Improve RegEx for "IF in SELECT statement" in translations.php#416

Merged
patrickebates merged 1 commit intoProjectNami:masterfrom
srutzky:patch-1
Dec 2, 2020
Merged

Improve RegEx for "IF in SELECT statement" in translations.php#416
patrickebates merged 1 commit intoProjectNami:masterfrom
srutzky:patch-1

Conversation

@srutzky
Copy link
Contributor

@srutzky srutzky commented Nov 6, 2020

This fixes #366
This fixes #415
This might also fix some others as I found additional problems with the RegEx that were not mentioned in either of those issues.

There are two major problems with the current code:

  1. the RegEx pattern used to find instances of MySQL's IF() function (not statement) is overly broad in several areas, allowing for both unexpected captures and capturing too much, and ...
  2. Both preg_match and preg_replace are being used incorrectly such that in the case of multiple IF() functions in the query, the values from the first match would be used to re-write all of the matches.

(There is a link at the bottom, in the "Solution" section, to a code testing side that has examples for most everything noted below)

Problem Uno

The current pattern is: (IF\s*\(*((.*),(.*),(.*))\)\s*(AS\s*\w*))

The intention is to capture something like: IF (condition, return_if_true, return_if_false) AS alias
and then convert it to: CASE WHEN condition THEN return_if_true ELSE return_if_false END AS alias

However, the pattern specifically looks for the following (case-insensitive):

  1. The letters IF (regardless of what precedes these two letters)
  2. zero or more whitespace
  3. zero or more left/open parenthesis
  4. zero or more anything
  5. a single comma
  6. zero or more anything
  7. a single comma
  8. zero or more anything
  9. a single right/closing parenthesis
  10. zero or more whitespace
  11. The letters AS
  12. zero or more whitespace
  13. zero or more word characters (A-Za-z0-9_)

Meaning, the only requirements are:

  1. The letters IF (regardless of what precedes these two letters)
  2. a single comma
  3. a single comma
  4. a single right/closing parenthesis
  5. The letters AS

Critical (functional) issues

  1. The "\(" is followed by "*" making it optional when it is in fact required ( https://dev.mysql.com/doc/refman/8.0/en/flow-control-functions.html#function_if ). The "\)" correctly has no quantifier (hence required).
  2. The beginning of the pattern starts with "IF" which does not restrict what comes before that, which is why both "notified" (Help needed modifying SQL-Parser-Engine #415) and "DATEDIFF" (Project Nami 2.2.2 - Weird SQL error... #366) matched (well, this AND not requiring the left paren). The beginning of this pattern needs to require a word-boundary via "\b" that won't match on word characters ( https://www.regular-expressions.info/refwordboundaries.html — selecting "PCRE" shows that this is available in ASCII mode, which is currently the case).
  3. The 3 operands of the "IF", separated by commas, currently allow for:
    1. any combination of the operands to not exist / be empty (e.g. "if(,,) as"), yet all are required (see previously linked documentation). The "+" quantifier needs to be used instead of "\*".
    2. grabbing too much text in cases where multiple valid matches exist, or even when there is another ") AS" following any two commas, which can be a function call, an expression, etc.

Non-critical (performance) issues

  1. Outer-most parens grouping entire pattern (currently group # 1) is an extraneous capture as index 0 of returned matches array is the entire match ( http://docs.php.net/manual/en/function.preg-match.php )
  2. The next left-paren grouping all three "(.*)" together (currently group # 2) is also extraneous as it is never used.

Problem Dos

The current RegEx implementation is as follows:

$pattern = '/(IF\s*\(*((.*),(.*),(.*))\)\s*(AS\s*\w*))/is';
preg_match($pattern, $query, $limit_matches);
if (count($limit_matches) == 7) {
	$case_stmt = ' CASE WHEN ' . $limit_matches[3] . ' THEN ' . $limit_matches[4] . ' ELSE ' . $limit_matches[5] . ' END ' . $limit_matches[6];
	$query = preg_replace($pattern, $case_stmt, $query);
}

This is an improper pattern for using RegEx functions:

  1. preg_match only returns the first match (from the given starting point, and since no alternate starting point is passed in, it starts at the beginning). If there is at least one match, the code will enter the IF block and concatenate the desired translation (i.e. $case_stmt) based on the values from that first match. In cases where there are more than one match, the subsequent matches are never even searched for. To get multiple matches you need to use preg_match_all which returns an array of matches that can be looped through to deal with individually. HOWEVER, using preg_match_all here won't fix anything because of issue # 2 ...
  2. preg_replace does not only deal with the first match: it replaces all matches. Meaning, even if one did loop over all matches using preg_match_all, the first call to preg_replace would still replace all matches with the string concatenated with values from the first match. The proper way to use a RegEx replace (this is not specific to PHP) is to use backreferences to directly references capture groups in the replacement text. Meaning, rather than concatenating values from preg_match into a string to use a static replacement string, you can skip the entire preg_match step and build what should be the translated value in the replacement text, and that will be applied per match. In this case, the replacement string would be: ' CASE WHEN $3 THEN $4 ELSE $5 END $6'. So, the end result is that we get rid of the preg_match and the $case_stmt variable + concatenation and are left with only 2 lines: the setting of $pattern, and calling preg_replace.

Solution

Part A

I simplified the RegEx pattern to:

  1. capture only IF statements (not something like "datedIFf")
  2. capture just the IF statement (not something like "IF (a, b, c) AS alias, d, function(e) as f")
  3. capture multiple IF statements in the query

I made the following changes to the RegEx pattern:

  1. Removed the outer-most set of parens as it was an extraneous full-match capture that is always present as first / 0 index value of any returned match (and it was also unused).
  2. Added \b to require a word-boundary before the IF
  3. Added a capture around the I in IF to re-use that character in the replacement string (to match the case, which might be a bit over-the-top, but some people are super picky about the SQL 😉 )
  4. Removed the * just after the \(, thus requiring the left/opening parenthesis
  5. Removed the capture around the set of three operands within the parenthesis (was not being used)
  6. Removed the capture around each of the individual operands within the parenthesis (they are no longer being used individually)
  7. Changed quantifier from * to + for each of the three operands within the parenthesis, thus requiring each one.
  8. Added the ? after the + quantifier for each of the three operands within the parenthesis, so that they are not "greedy" and now should only take the minimum amount of characters needed to form a valid match.
  9. Added ?: to the final capture group to make it non-capturing (better for memory / performance) as it won't be reference or used (and it could probably not even be a capture group in the first place, but will leave as-is for now).

Part B

I made the following changes to the code:

  1. Removed call to preg_match
  2. Removed setting of $case_stmt
  3. Added setting of $replacement
  4. Removed if (count($limit_matches) == 7)
  5. Swapped out $case_stmt for $replacement in preg_replace

Part C

Also, SQL Server 2012 (the lowest version supported by Project Nami) introduced the "IIF()" function, which is identical to MySQL's "IF()" function, so rather than chopping up the IF function to reconstitute as a CASE WHEN, simply prefix the "IF()" function with an extra "I". Easy peasy. 😸

Examples

running in PHP 7.3.5 can be found at: https://ideone.com/bhisge

Misc.

Just to have this documented in case it comes up later:

  1. Also, whitespace was not required after the "AS", yet it is required when the column alias is not delimited. However, changing the quantifier after the "AS" from "*" to "+" doesn't entirely fix the issue as the two possible optional delimiters — single quotes and double quotes — are not accounted for when using "\w". Unfortunately, accounting for all of the possible variations of how column aliases can be specified would complicate the RegEx pattern, so if the current pattern correctly captures everything it needs to, then no need to complicate the pattern as that might have unintended consequences and would need to be more thoroughly tested. The reason why this has not shown itself to be a problem thus far is most likely due to the "\w" also using the "*" quantifier, making the alias itself entirely optional (well, outside of the "AS"), which is good in terms of the current pattern considering that an aliases being delimited would invalidate the entire pattern if the "\w" was using the "+" quantifier. Of course, if the "AS" exists, then there should be an alias to go along with it. Unfortunately, using "\w" won't work given that delimited aliases can contain most, if not all, characters (not sure what the restrictions are in MySQL, but SQL Server allows all characters except U+0000 and U+FFFF).
  2. What might be a bigger problem is that the pattern currently requires the "AS", but neither MySQL nor MS SQL Server require the "AS" for a column alias, so there might be IF() functions in SELECT statements that are not matching due to not using "AS" for a column alias.

Some examples of the acceptable variations (all have been tested on MySQL 5.6)

select if(1 = 1, 'a', 'b')as"bob1"
select if(1 = 1, 'a', 'b')"bob2"
select if(1 = 1, 'a', 'b')'bob3'
select if(1 = 1, 'a', 'b')bob4
select if(1 = 1, 'a', 'b')as4bob5

ERROR:
select if(,'a', "b")as'g'


Take care,
Solomon...
https://SqlQuantumLift.com/
https://SqlQuantumLeap.com/
https://SQLsharp.com/

(Full explanation will be in the PR)

RegEx pattern for "IF in SELECT statement" is overly broad in several areas, allowing for both unexpected captures and capturing too much.
I simplified the RegEx pattern to restrict it to:
1. capturing only IF statements (not something like "datedIFf")
2. capturing just the IF statement (not something like "IF (a, b, c) AS alias, d, function(e) as f")
3. capturing multiple IF statements in the query

Also, SQL Server 2012 (the lowest version supported by Project Nami) introduced the "IIF()" function, which is identical to MySQL's "IF()" function, so rather than chopping up the IF function to reconstitute as a CASE WHEN, simply prefix the "IF()" function with an extra "I". Easy peasy. 😸
@srutzky srutzky changed the title Improve RegEx pattern for "IF in SELECT statement" in translations.php Improve RegEx for "IF in SELECT statement" in translations.php Nov 6, 2020
@patrickebates
Copy link
Member

Thanks for the work on this. As you may have noticed from the comments at the top of the file, the bulk of this code was written for use with a previous attempt to simply extend wpdb. It was written for SQL 2008 R2 or earlier, and attempted to translate all queries as they were executed against the DB. We only included it in Project Nami in an attempt to provide support for plugins not using the WP query objects as intended.

We have made some modifications over the years to support specific use cases, primarily for plugins we encountered ourselves or which had obviously solutions. We've also seen a handful of submissions such as this, and welcome them when we get them.

For our records here, can you tell me if these updates are to support a specific plugin or theme you are using? Or did you encounter these issues with some in-house code you needed to use?

@srutzky
Copy link
Contributor Author

srutzky commented Nov 6, 2020

@patrickebates Ah, ok. So in that case, just for the record, there's a slight change to what I've committed so far in order to work for 2008 R2 and older:

  1. $pattern should be: '\bIF\s*\((.+?),(.+?),(.+?)\)\s*(AS\s*\w*)'
  2. $replacement should be: ' CASE WHEN $1 THEN $2 ELSE $3 END $4'

No, this update is not to support any particular plugin, nor have I encountered issues related to this specific code. I was actually trying to hunt down the source of a problem that I am having on a site that uses Project Nami, had looked through the issues and PRs first to make sure that my issue wasn't already documented, and then came across this code and remembered seeing 2 issues dealing with "if" rewrites. Since I could see what the problem was, I figured I could help a little before moving on to hopefully finding the cause of the issue I'm running into.

I will open a new issue for what might be a simple question, or at least to point me in the right direction.

@patrickebates
Copy link
Member

No need to make this compatible with 2008 R2. As you mentioned. PN only supports 2012 and forward. I just believe that there will be similar areas of this file which probably could be improved like you have done if they get looked at with an eye for newer versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Help needed modifying SQL-Parser-Engine Project Nami 2.2.2 - Weird SQL error...

2 participants