Skip to content

use the latest datasketches-java-4.0.0#14334

Merged
AlexanderSaydakov merged 4 commits intomasterfrom
datasketches-4.0.0
May 28, 2023
Merged

use the latest datasketches-java-4.0.0#14334
AlexanderSaydakov merged 4 commits intomasterfrom
datasketches-4.0.0

Conversation

@AlexanderSaydakov
Copy link
Copy Markdown
Contributor

Description

This is to update the datasketches-java dependency to the latest 4.0.0 release with some API changes, and also to datasketches-memory-2.2.0

Some expectations in the unit tests were changed because in this version of datasketches-java all quantile sketches have "inclusive" mode by default. It is possible, if necessary, to use sketches in the "exclusive" mode explicitly to get the same functionality as before, but in my view it is not necessary, and most users won't notice any difference. It might happen to be important for someone in some special circumstances, but I doubt that. Perhaps even better would be to introduce a parameter, so that the mode could be chosen by the user, but that is a lot more work.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented May 23, 2023

Looks pretty clean. Thanks for the patch!

Reading https://github.com/apache/datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantilescommon/QuantileSearchCriteria.java, I don't immediately see situations where the difference between inclusive and exclusive would be meaningful. Do you know of situations where the difference is in fact important? That'd help understand the potential impact of the change.

I also wonder, if the old behavior was exclusive, then why change it? I mean, why not have things always be exclusive post this patch as well? Is inclusive fundamentally better in some way?

@AlexanderSaydakov
Copy link
Copy Markdown
Contributor Author

We hesitated for some time, but finally decided that inclusive mode is a bit better. This is a major version change with some API incompatibility, so, if ever, this is the right time for the change.
The difference is in the definition of rank. Suppose we are analyzing a distribution of some items exactly. The only thing required is a comparator of items ("less than" operator). We sort the items and define the rank of an item as the proportion of the whole distribution strictly less than that item in the exclusive mode or less than or equal to that item in the inclusive mode. It seems that the inclusive mode is more common in the literature and is slightly more well-behaved in some edge cases.
To illustrate the difference, suppose we have just one item. Its rank in inclusive mode is 1, but 0 in exclusive mode. But with millions of items the difference in rank will be tiny, and, most probably, negligible. If we do a histogram or partitioning, some items on the edges can fall into the bin or partition on the right or on the left depending on the mode.

@AlexanderSaydakov
Copy link
Copy Markdown
Contributor Author

It seems that some test failed, but I cannot reproduce that in my environment. I guess some expectations of rank should be adjusted, but that particular case is not exercised when I run "mvn test" for some reason.

@clintropolis
Copy link
Copy Markdown
Member

It seems that some test failed, but I cannot reproduce that in my environment. I guess some expectations of rank should be adjusted, but that particular case is not exercised when I run "mvn test" for some reason.

Looks like this is probably related to SQL compatible null handling mode, you can run locally by adding -Ddruid.generic.useDefautlValueForNull=false, it does not run by default but we do run both modes in CI.

The other thing that needs updated is the licenses.yaml file entries for the dependencies which have been updated https://github.com/apache/druid/blob/master/licenses.yaml#L3787

@AlexanderSaydakov
Copy link
Copy Markdown
Contributor Author

@clintropolis are you sure about the parameter? Tests pass with the parameter you suggested.

@AlexanderSaydakov
Copy link
Copy Markdown
Contributor Author

What is the right way to run "mvn test" in that mode manually?
Thank you.

@AlexanderSaydakov
Copy link
Copy Markdown
Contributor Author

AlexanderSaydakov commented May 26, 2023

@clintropolis oh, I see! you misspelled the parameter. I copied and did not notice.

@clintropolis
Copy link
Copy Markdown
Member

clintropolis commented May 28, 2023

@clintropolis oh, I see! you misspelled the parameter. I copied and did not notice.

oof, my mistake, sorry about that!

@AlexanderSaydakov AlexanderSaydakov merged commit 4131c0d into master May 28, 2023
@AlexanderSaydakov AlexanderSaydakov deleted the datasketches-4.0.0 branch May 28, 2023 05:19
@gianm gianm added this to the 27.0 milestone Jul 3, 2023
@cryptoe
Copy link
Copy Markdown
Contributor

cryptoe commented Oct 23, 2023

The datasketch version 4.xx spitting out weird splits for an ItemSketch<Long> when the function sketch.getPartitionBoundaries(numSplits)
When an item sketch with monotonically increasing number from 0 to 10 billion is chosen the distribution for 40 splits

0
250025928
500108450
750197955
1000017365
1250101575
1500191183
1750009306
2000098020
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
2147424894
9999999999

Similar item sketch in the 3.2.0 generates the correct boundaries when the function sketch.getQuantiles(40) is used.

 0
256423887
512799757
769177971
1025548880
1281929124
1538304632
1794673747
2051315187
2307669277
2564042244
2820409807
3076787762
3333187311
3589563105
3845931306
4102575213
4359042434
4615417711
4871794533
5128171055
5384552162
5640921934
5897299914
6153942953
6410313978
6666657948
6923038164
7179411194
7435787409
7692168813
7948547259
8205184939
8461560834
8717940855
8974345598
9230745974
9487159248
9743589767
9999999999

Pseudo code used for experiments

ItemsSketch<Long> sketch = ItemsSketch.getInstance( 32768, Comparator.naturalOrder());

    for(long i=0;i<10_000_000_000L;i++){
      if(i%100_000_000==0){
        System.out.println("reached "+i);
      }
      sketch.update(i);
    }

for(Long val:sketch.getQuantiles(40)){
      System.out.println(val);
    }
    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants