adding new post aggregators for test statistics to druid-stats extension#4532
adding new post aggregators for test statistics to druid-stats extension#4532jon-wei merged 9 commits intoapache:masterfrom
Conversation
jihoonson
left a comment
There was a problem hiding this comment.
Hi @chunghochen, thanks for the nice work!
I left a couple of comments. Also, please merge the master branch and check the build failure due to invalid formats.
| { | ||
| "type": "zscore2sample", | ||
| "name": "<output_name>", | ||
| "fields": [<count 1 (post_aggregator1)>, <sample size 1 (post_aggregator2)>, <count 2 (post_aggregator3)>, <sample size 2 (post_aggregator4)>] |
There was a problem hiding this comment.
It looks that the number of arguments is always 4 and their order matters. If so, I think it would be better to receive 4 individual arguments instead of a fields list because it is less error-prone and intuitive which arguments are needed. What do you think?
There was a problem hiding this comment.
Yes, fields in this case required 4 arguments and the order matters.
However, I don't think it is more error prone or less intuitive, than using individual arguments.
I think at this juncture, fields as input parameters fit well since
- It is aligned and consistent with the way fields are used in Druid. Pl see http://druid.io/docs/latest/querying/aggregations.html and http://druid.io/docs/latest/querying/post-aggregations.html
- As we might fine-tune, or evolve the input parameters so that the stat functions can be most useful and powerful, keep them in fields at this juncture should be good, so it can be continuously evolved / fine-tuned without dramatically changing the input signature.
There was a problem hiding this comment.
However, I don't think it is more error prone or less intuitive, than using individual arguments.
With the current design, users should remember the meaning of arguments according to their appearance order in the fields list or see the document every time. What's worse, users are difficult to figure out even though they input the arguments in wrong order (e.g., [<count 2 (post_aggregator3)>, <sample size 2 (post_aggregator4)>, <count 1 (post_aggregator1)>, <sample size 1 (post_aggregator2)>]).
If they should be specified individually, users, at least, cannot input arguments of wrong number or in wrong order.
It is aligned and consistent with the way fields are used in Druid.
In this case, I think the usability (stated above) matters rather than consistency. There is no point for arbitrary number of arguments.
As we might fine-tune, or evolve the input parameters so that the stat functions can be most useful and powerful, keep them in fields at this juncture should be good, so it can be continuously evolved / fine-tuned without dramatically changing the input signature.
If you have any plan, please let us know. I guess even changing input signature is not a big problem because arguments can be easily added and removed (thanks to json).
There was a problem hiding this comment.
I agree with @jihoonson here -- a fields array makes sense if the fields are all treated basically the same, and if arbitrary numbers of fields are allowed, but in this case neither is true. The fields mean different things semantically, and the number is fixed, so the API would be nicer if the four arguments are given meaningful names.
It would still be easy to evolve the API with named parameters, and in fact I think it's easier, since the parameters are named rather than positional.
| <dependency> | ||
| <groupId>org.apache.commons</groupId> | ||
| <artifactId>commons-math3</artifactId> | ||
| <version>3.6.1</version> |
There was a problem hiding this comment.
Version is not needed here because the parent module specifies it.
There was a problem hiding this comment.
Great catch! I overlooked it when porting to the latest version of Druid.
|
Also, please add a doc for new post aggregators. |
gianm
left a comment
There was a problem hiding this comment.
Thank you for the contribution @chunghochen. I left some comments on the code.
I didn't review the math, just reviewed the Druid aspects of the code.
| { | ||
| "type": "zscore2sample", | ||
| "name": "<output_name>", | ||
| "fields": [<count 1 (post_aggregator1)>, <sample size 1 (post_aggregator2)>, <count 2 (post_aggregator3)>, <sample size 2 (post_aggregator4)>] |
There was a problem hiding this comment.
I agree with @jihoonson here -- a fields array makes sense if the fields are all treated basically the same, and if arbitrary numbers of fields are allowed, but in this case neither is true. The fields mean different things semantically, and the number is fixed, so the API would be nicer if the four arguments are given meaningful names.
It would still be easy to evolve the API with named parameters, and in fact I think it's easier, since the parameters are named rather than positional.
| import java.util.Set; | ||
|
|
||
| /** | ||
| * Created by chunchen on 4/5/17. |
There was a problem hiding this comment.
Please don't include author information in the code. Git will remember it.
| public PvaluefromZscorePostAggregator( | ||
| @JsonProperty("name") String name, | ||
| @JsonProperty("field") PostAggregator field | ||
| ) { |
There was a problem hiding this comment.
Brace style isn't quite right here (and elsewhere in this file) so please apply Druid code style linked off https://github.com/druid-io/druid/blob/master/CONTRIBUTING.md.
| } | ||
|
|
||
| @Override | ||
| public String toString() { |
There was a problem hiding this comment.
Consider overriding equals and hashCode as well.
|
|
||
| @Override | ||
| public PostAggregator decorate(Map<String, AggregatorFactory> aggregators) { | ||
| return this; |
There was a problem hiding this comment.
This should recursively decorate the PostAggregator field, for example:
return new PvaluefromZscorePostAggregator(name, Iterables.getOnlyElement(Queries.decoratePostAggregator(Collections.singletonList(field), aggregators));That way, any nested post-aggregators that need decoration will receive it.
| * @param the success count of population 2 | ||
| * @param sample size of population 2 | ||
| */ | ||
| private double zScoreTwoSamples(Double s1count, Double p1count, Double s2count, Double p2count) { |
There was a problem hiding this comment.
These might as well be double rather than Double as they are assumed to be non-null.
| return (convertRate1 - convertRate2) / | ||
| Math.sqrt((convertRate1 * (1 - convertRate1) / p1count) + | ||
| (convertRate2 * (1 - convertRate2) / p2count)); | ||
| } catch (Exception ex) { |
There was a problem hiding this comment.
What kind of Exception can be thrown here? Could this be made more specific?
Also, is it possible to use some other mechanism of detecting whatever problem this exception is detecting? Exceptions are quite slow compared to other forms of flow control and so I'm concerned about using them in a post-aggregator compute function.
| import java.util.Map; | ||
|
|
||
| /** | ||
| * Created by chunchen on 4/23/17. |
There was a problem hiding this comment.
Please don't include author tags.
| try { | ||
| NormalDistribution normDist = new NormalDistribution(); | ||
| return normDist.cumulativeProbability(x); | ||
| } catch (Exception ex) { |
There was a problem hiding this comment.
What kind of Exception can be thrown here? Could this be made more specific?
Also, is it possible to use some other mechanism of detecting whatever problem this exception is detecting? Exceptions are quite slow compared to other forms of flow control and so I'm concerned about using them in a post-aggregator compute function.
| } | ||
|
|
||
| @Override | ||
| public String toString() { |
There was a problem hiding this comment.
Consider overriding equals and hashCode as well.
|
Thank you @gianm and @jihoonson for the reviews and comments!!
|
update with latest changes from upstream
…erge upstream/master
|
I can't see the errors when clicking the Details link of
|
|
@chunghochen, hmm i'm not sure why the job was canceled. Maybe there was a problem in travis. I restarted the job. |
| p1 = (successCount1) / (sample size 1) | ||
|
|
||
| For example, | ||
| p2 = (successCount2) / (sample size 2) |
There was a problem hiding this comment.
It would be better to change sample size 1 and sample size 2 to camel case for consistency and better readability.
There was a problem hiding this comment.
As we are doing two sample z-test to calculate the zscore, from sample 1 and sample 2.
Therefore, sample 1 size and sample 2 size are more meaningful in this context.
Will change sample2size to sample2Size for consistency.
| "name": "<output_name>", | ||
| "fields": [<count 1 (post_aggregator1)>, <sample size 1 (post_aggregator2)>, <count 2 (post_aggregator3)>, <sample size 2 (post_aggregator4)>] | ||
| "successCount1": <post_aggregator> success count of sample 1, | ||
| "sample1Size": <post_aggregaror> sample 1 size, |
There was a problem hiding this comment.
Is sampleSize1 better? Because successCount is postfixed by a number.
There was a problem hiding this comment.
As described
successCount1: success count of sample 1
sample1size: sample 1 size
In short, we are doing two sample z-test to calculating the zscore, from sample 1 and sample2.
It appears sampleSize1 deviates from its original statistical meaning.
There was a problem hiding this comment.
I think people won't be confused if the sampleSize is postfixed by the number and we give a proper description. Two different types of numbering for arguments make people confused and being prone to mistakes.
| # Test Stats Aggregators | ||
|
|
||
| Incorporates test statistics related aggregators, including z-score and p-value. Please refer to [https://www.paypal-engineering.com/2017/06/29/democratizing-experimentation-data-for-product-innovations/](https://www.paypal-engineering.com/2017/06/29/democratizing-experimentation-data-for-product-innovations/) for background and details. | ||
| Incorporates test statistics related aggregators, including z-score and p-value. Please refer to [https://www.paypal-engineering.com/2017/06/29/democratizing-experimentation-data-for-product-innovations/](https://www.paypal-engineering.com/2017/06/29/democratizing-experimentation-data-for-product-innovations/) for math background and details, although its input spec and example are out of date. |
There was a problem hiding this comment.
although its input spec and example are out of date.
I'm not sure this is a necessary statement because if the linked document is really out of date and irrelevant to the description here, we usually change it to more proper link.
Also, I think it's better to add links describing the definitions of z-score and p-value like wikipedia links because that blog post describes a sort of bigger concept how these new post aggregators are used in your case. Even the blog post links wikipedia for the definitions of z-score and p-value.
There was a problem hiding this comment.
Hi Jihoon,
The purpose of the blog post is to explain and connect all the related concepts to derive how post aggregators do the calculations. IOW, it is to bridge the gap between what wikipedia states and what the post aggregators do.
Not sure if you're asking to remove "although its input spec and example are out of date" in test-stats.md?
Please advise. Many thanks!
There was a problem hiding this comment.
The purpose of the blog post is to explain and connect all the related concepts to derive how post aggregators do the calculations. IOW, it is to bridge the gap between what wikipedia states and what the post aggregators do.
If so, I think it's better to remove "although its input spec and example are out of date." statement.
There was a problem hiding this comment.
okay, if you insist.
| "successCount1" : | ||
| { "type" : "constant", | ||
| "name" : "successCountPopulation1", | ||
| "name" : "successCountFromPopulation1Sample", |
There was a problem hiding this comment.
Same here. Looks better to put the number at last rather than in the middle.
There was a problem hiding this comment.
We are calculating two-tailed p-value from zscore.
There was a problem hiding this comment.
I think it is important to have the naming represents its statistical meaning.
In this case, zscore2sample really means we are doing two sample z-test to calculate the zscore.
Further, we are calculating two-tailed P-value from the z-score.
Naming calls for compressing a few statistical concepts together. It is more than mechanically placing the number in the middle or at the end of the name. We need to first describe the name in statistical terms and then try to compress it to a name which makes sense.
There was a problem hiding this comment.
Ok, thanks for explanation.
| */ | ||
| @JsonTypeName("zscore2sample") | ||
| public class ZtestPostAggregator implements PostAggregator { | ||
| public class ZtestPostAggregator implements PostAggregator |
There was a problem hiding this comment.
Just ouf of curiosity, why is the class name different from its type name?
There was a problem hiding this comment.
The class name follows the naming convention of all Druid PostAggregators.
The Json Type Name just tried to be brief and meaningful.
There was a problem hiding this comment.
The class name tries to be brief while following the naming convention of all Druid postAggregators.
The Type Name just tries to be meaningful in the context of test statistics.
There was a problem hiding this comment.
I mean, why the class name can't be simply ZScore2SamplePostAggregator? This is maybe a silly question because I'm not familiar with stats.
There was a problem hiding this comment.
There are various test stats can be calculated based on z-test. It' not clear whether we should have one class for every variation of z-test based stats. For example, it might make more sense to have 1-sample and 2-sample zscore in one class, rather than in two different classes. In other words, I'm not thinking in terms of naming convention, but rather more along the line of extensibility and componentization.
There was a problem hiding this comment.
It sounds good to me, but if so, the json type name should also be an extendable name, right?
There was a problem hiding this comment.
The JsonTypeName (zscore2sample) is intended to be brief and meaningful to the user, so it is different from the class name ZtestPostAggregator. Not sure what you mean by "the json type name should also be an extendable name"? But if you mean we should make the type name in a way so that we can easily add a new type in a consistent manner; then yes. JsonTypeName is designed from the user perspective and is exposed to the user, while the class name is to do with implementations. The rationales behind these namings are different.
| private final PostAggregator successCount1; | ||
| private final PostAggregator sample1Size; | ||
| private final PostAggregator successCount2; | ||
| private final PostAggregator sample2Size; |
There was a problem hiding this comment.
get() method for these methods are missing.
There was a problem hiding this comment.
Also please add a json serde test.
There was a problem hiding this comment.
These two postAggregators implement PostAggregator Interface. Get() is only needed for Aggregator Interface. I failed ti see any postAggregator implementing get() method.
There was a problem hiding this comment.
Not sure why json serde is needed in this case? These two postAggregators didn’t persist into Druid by per se. And there is no custom Serde either.
One can think of them of glorified arithmetic post aggregators to calculate some test statistics.
There was a problem hiding this comment.
These two postAggregators implement PostAggregator Interface. Get() is only needed for Aggregator Interface. I failed to see any postAggregator implementing get() method.
There was a problem hiding this comment.
Not sure why json serde is needed in this case? These two postAggregators didn’t persist into Druid by per se. And there is no custom Serde either.
One can think of them of glorified arithmetic post aggregators to calculate some test statistics.
There was a problem hiding this comment.
Looks that my comments were confused. Sorry for lack of details. I meant, the series of methods of getSuccessCount1(), getSample1Size(), getSuccessCount2(), and getSample2Size() should be added and annotated with @JsonProperty because they are needed for JSON serialization. Unit tests are needed to avoid such mistakes.
There was a problem hiding this comment.
Thanks for clarifying this! will do as requested.
| @@ -46,79 +48,92 @@ | |||
| http://facweb.cs.depaul.edu/sjost/csc423/documents/test-descriptions/indep-z.pdf | |||
There was a problem hiding this comment.
This comment should be javadoc.
| @JsonProperty("sample1Size") PostAggregator sample1Size, | ||
| @JsonProperty("successCount2") PostAggregator successCount2, | ||
| @JsonProperty("sample2Size") PostAggregator sample2Size | ||
|
|
There was a problem hiding this comment.
Do you mean to remove line 67?
I'm using Druid 11 checkstyle.xml and IntelliJ to do reformatting. And it passes the stylecheck.
According to IntelliJ's interpretation continuation indent is 4.
Just like to make sure I understand what you mean. :-)
There was a problem hiding this comment.
Do you mean line#67? Or if there are more unnecessary whitespace you're referring to?
Just like to make sure we are on the same page. :-)
There was a problem hiding this comment.
Yes, please remove this line.
| + fields | ||
| + "}"; | ||
| "name='" + name + '\'' + | ||
| ", successCount1" + successCount1 + |
There was a problem hiding this comment.
Please insert '=' between string and value.
| ) | ||
| { | ||
| Preconditions.checkNotNull(name, "Must have a valid, non-null post-aggregator"); | ||
| Preconditions.checkNotNull(zScore, "Must have a valid, non-null post-aggregator"); |
There was a problem hiding this comment.
Also it's better to check zScore is an intance of ZtestPostAggregator.
There was a problem hiding this comment.
One can calculate the z-score oneself without using ZtestPostAggregator. Didn't check whether it is an instance of ZtestPostAggregator as we should keep it flexible and customizable if the user opts for it.
There was a problem hiding this comment.
Actually, we want to keep it flexible and open.
For example, one can calculate it's own z-score for example if needed, and just feed its zscore to calculate p-value.
| } | ||
|
|
||
| @Override | ||
| public int hashCode() |
There was a problem hiding this comment.
Please implement equals() method either. This will remove the potential problem when working with collections.
| } | ||
|
|
||
| @Override | ||
| public int hashCode() |
There was a problem hiding this comment.
Please implement equals() method either. This will remove the potential problem when working with collections.
|
Not sure why the checks failed?
Also I can't see ihoonson requested changes, as I clicked "See review" link but failed to see new review comment. Please help/advise. |
|
We've been having issues with Travis CI recently, the third test there often fails with a timeout, I've reset that test. To see the TeamCity report, you can click on "Login as Guest". I checked the error report there and they don't seem related to this patch. |
|
Moving this to 0.11.1, as 0.11.0 is feature frozen now |
|
@chunghochen Can you try merging master to your branch? The TeamCity failures shown in the report were fixed by #4809 |
|
@jon-wei
Not sure what git command I should do to merge master to my branch? |
You can do: |
|
@jon-wei |
|
Hopefully this pull request can be merged in when the codeline is open? :-) |
|
@jihoonson Not sure where we are on this pull request? |
jihoonson
left a comment
There was a problem hiding this comment.
@chunghochen I think it's almost done. Left some trivial comments. In addition, please comment on #4532 (comment).
|
|
||
| pvaluefromZscorePostAggregator = new PvaluefromZscorePostAggregator("pvalue", zscore); | ||
|
|
||
| System.out.print("zscore = " + zscore + "\n"); |
There was a problem hiding this comment.
Please remove debug code.
| -1783.8762354220219 | ||
| ))).doubleValue(); | ||
|
|
||
| System.out.print("pvalue = " + pvalue + "\n"); |
There was a problem hiding this comment.
Please remove debug code.
| /* Assert P-value is positive and very small */ | ||
| Assert.assertTrue(pvalue >= 0 && pvalue < 0.00001); | ||
|
|
||
| System.out.print(pvaluefromZscorePostAggregator.toString()); |
There was a problem hiding this comment.
Please remove debug code.
| zscore, 0.0001 | ||
| ); | ||
|
|
||
| System.out.print(ztestPostAggregator.toString()); |
There was a problem hiding this comment.
Please remove debug code.
| @Override | ||
| public int hashCode() | ||
| { | ||
| int result = name != null ? name.hashCode() : 0; |
There was a problem hiding this comment.
Please remove unnecessary null check for name. It's already tested in the constructor.
| @Override | ||
| public int hashCode() | ||
| { | ||
| int result = name != null ? name.hashCode() : 0; |
There was a problem hiding this comment.
Please remove unnecessary null check for name. It's already checked in the constructor.
|
@jihoonson and @jon-wei, can you guys please help? Not sure why it fails on both checks? continuous-integration/travis-ci/pr — The Travis CI build failed Please help!! |
|
The CI test failed again because of an unrelated test timeout (happens to a lot of PRs), I've restarted the build for that test suite. I believe the TeamCity build is failing because some PRs that introduced failures were committed without fixing the TeamCity issues, future builds for new PRs are still picking up the unaddressed inspection issues, your PR is fine and we can ignore them since none of the files are related to your changes. |
|
Thanks so much for looking into this! And confirming my PR is fine. :-) |
Great! Hope my latest commits address your concerns? I also commented on #4532 (comment). |
|
@jihoonson @jon-wei |
Please refer to https://www.paypal-engineering.com/2017/06/29/democratizing-experimentation-data-for-product-innovations/ for more context and details.