Skip to content

[ENHANCEMENT] Convert dedup pushdown to composite + top_hits #4797

@LantaoJin

Description

@LantaoJin

Is your feature request related to a problem?
#3972 converts dedup command to collapse search when pushdown enabled. By benchmarking, the performance of collapse search shows worse than composite + top_hits or terms + top_hits. Here is the benchmarking result:

  1. terms + top_hits 2000ms
  2. composite + top_hits (missing_bucket false/true:) 3500ms/4500ms
  3. collapse 13000ms

Additional, (1) and (2) can support dedup on bool, date etc fields and script expr, (3) only works on keyword/numeric field. and (2) can support search_after in future

What solution would you like?
Refactor to approach (2)

Do you have any additional context?

curl -XPOST "http://localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "_source": false,
  "aggs": {
    "packets": {
      "terms": {
        "field": "cloud.region",
        "size": 10000
      },
      "aggs": {
        "topN": {
          "top_hits": {
            "size": 1,
            "_source": true
          }
        }
      }
    }
  }
}'

2000ms

curl -XPOST "http://localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "timeout": "1m",
  "_source": false,
  "aggs": {
    "packets": {
      "composite": {
        "size": 10000,
        "sources": [
          {
            "name": {
              "terms": {
                "field": "cloud.region",
                "missing_bucket": true
              }
            }
          }
        ]
      },
      "aggs": {
        "topN": {
          "top_hits": {
            "size": 1,
            "_source": true
          }
        }
      }
    }
  }
}'

missing_bucket false 3500ms
missing_bucket true 4500ms

curl -XPOST "http://localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{
  "from": 0,
  "size": 10000,
  "timeout": "1m",
  "_source": true,
  "collapse": {
    "field": "cloud.region"
  }
}'

13000ms

Metadata

Metadata

Assignees

Labels

PPLPiped processing languageenhancementNew feature or requestpushdownpushdown related issues

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions