Improve `queries` tool #624

gustingonzalez · 2025-11-19T03:41:20Z

Key changes in this pull request:

Replaces --extract option by --output, which now requires an explicit output file. Since more than one algorithm (query type) could be specified, the algorithm is now also printed in the TSV.
The original op_perftest() behavior remains available when --output is not specified.
Adds a new --runs option to specify the number of runs to measure the query set (by default: 3). Note that this parameter excludes warmup.
The summary now includes per-run query timing aggregation using none, min, mean, median and max as aggregation types.
Measures each query once per run as with the former op_perftest(), so the set of queries is evaluated independently in each run.

…e minimum

elshize

This is not entirely what I had in mind. I think the aggregate function should either be applied to everything or we should just not aggregate at all. Otherwise, it's just too complex to keep track of what is what.

I was also thinking we either summarize or extract. But I think it's fine to just use different streams, only then I would remove the option to summarize only and just always summarize.

Let's discuss this a little more.

@JMMackenzie do you think it makes sense to extract results after aggregation? Say, return min for each query instead of R results where R is num of runs?

If not, then maybe it's best to just always extract everything and always print out summary to stderr and maybe that summary will always be (a) no aggregate, (b) min aggregate, and (c) mean aggregate? I frankly see no use for max or median. What are your thoughts?

If it makes sense to aggregate the actual output data, then I think there should always be 1 aggregate applied to both data and summary, or no aggregate at all.

elshize · 2025-11-19T21:17:14Z

tools/queries.cpp

+    case AggregationType::Median: return "median";
+    case AggregationType::Max: return "max";
+    }
+    return "unknown";


This should not return a value. Let's throw an exception instead. Maybe std::logic_error?

JMMackenzie · 2025-11-19T23:13:04Z

@JMMackenzie do you think it makes sense to extract results after aggregation? Say, return min for each query instead of R results where R is num of runs?

I think this could be reasonable, and I do think this was what I had initially envisaged. But I can see the benefit of extracing everything to stderr, and then also allowing a separate "aggregated" stream to either file or to stdout? I agree that if there is an aggregated stream being output, then a summary should also use that same aggregation. Maybe we can make a new "results" format where we dump both file.agg.summary and file.agg.stream?

elshize · 2025-11-20T00:23:39Z

I would actually try to avoid introducing new formats, I'd like to keep it as simple as possible, and as unsurprising as possible as well.

I think the most important is to give the user the raw data, which they can process however they want. I think we all agree on this part.

Then, I would lean towards simplicity:

I think having a single type of output (data vs summary) is much less ambiguous, so I lean towards having either one or the other. I'm thinking of the summary as a convenience during prototyping rather than actual serious data gathering.
Having the ability to aggregate for a query may be crucial for summary, and maybe useful for extracting data.
I think we should think hard what agg functions make sense, and not include those that don't. I'm struggling to find use for max or median. I think min and mean makes sense.

We will not support all types outputs from this tool but that's ok. This is why we give the user raw output.

Given the above, I would suggest the following algorithm:

results = run_benchmark(...)
if (aggregate) {
  results = aggregate_per_query(results, agg_function)
}
if (summarize) print_summary(results)
else print_data(results)

You have a few choices for implementing this.

One is having results as "rows", i.e., a vector of structs describing everything you need, including the query ID, and then aggregate_per_query can internally do group-by (could just use a hash map) and aggregate.

The other would be to have results as nested vectors and then after aggregation you still get nested vectors, only each inner vector has one element.

agg_function can be take multiple results and output multiple results as well, this way you can treat no aggregate as a type of aggregate (identity) -- maybe "transform" is a better name? I would prefer creating that transform function first (using CLI options) and then pass that and apply it for each query group, as opposed to pass name and have all the if-else conditions nested. It would be much cleaner that way, but it requires some design considerations.

print_data and print_summary don't need to know anything about how results were produced or transformed, only how to print them (or calculate stats).

gustingonzalez · 2025-11-20T02:35:48Z

I think that, for the case, the median could provide more robustness than mean; for example, by suppressing atypical cases or noise (mostly related to the maximum values across the runs).

As for the max aggregation, I don't now if it is really useful (maybe to capture take the worst cases?), but it may ultimately not be representative; I just included it because its implementation required no additional effort. If it has no real usefulness, I think it should be removed.

Regarding the methodology for printing data or a summary, I think it is useful to show the summary when extracting. In this case, if some of the metrics satisfy the user's needs, there is no need to run an external script (for that reason I think is useful to print all defined metrics when no aggregation/transformation is specified).

Also, given that the query times are printed to the output, they can simply be exported using redirection (>). If that is confusing, a file path could be used instead, but I think redirection keeps it simple. In the case of printing just the summary (for example, for a quick check of a prototype), I agree with it not makes sense to print the query times.

elshize · 2025-11-21T19:55:31Z

I think that, for the case, the median could provide more robustness than mean; for example, by suppressing atypical cases or noise (mostly related to the maximum values across the runs).

That's fair.

My main concern is making it too complex. How about we always extract all queries (user can process that data themselves) and always print all summaries?

I really don't want to go the route of defining aggregate function that will only apply to one or the other.

Just note that if you use stderr for summaries, we can't pipe it to another tool for transformation because redirect will capture the rest of the logs, so it will be purely informative.

gustingonzalez · 2025-11-22T20:00:40Z

The behavior in which all runs (together with all summaries) are extracted occurs when aggregation is set to none (or not specified): in this case, the tool shows the summary with the runs aggregated by min, mean, etc.

However, another experiment could be: "I want just to know what happens when all values are the minimum". Although I can obtain this value from the summary output, if I specify aggregation by min, it makes sense that the summary and the output adapt to that scenario, so I don’t need to implement an specific script to reprocess the output data (even though I understand such a script would be simple). This can be useful for quick experiments; if I want to understand the causes behind this value, because I can quickly analyze the file that already contains all the minimum values.

In any case, I understand that this may introduce unnecessary complexity from a SRP perspective, and an intermediate option would be to remove the --aggregate-by parameter and always show all the data with the summary for the different values, as mentioned.

What do you think @elshize, @JMMackenzie?

gustingonzalez · 2025-11-22T20:05:35Z

tools/queries.cpp

            auto usecs = run_with_timer<std::chrono::microseconds>([&]() {
-                uint64_t result = query_func(query, thresholds[idx]);
+                uint64_t result = query_func(query, thresholds[query_idx]);
                if (safe && result < k) {


@elshize, do you think it is convenient to define another path to avoid performing this check for each query run? I'm referring to the cost of evaluating it (which, in any case, is a "constant cost" that will be added to all runs).

Not sure if I understand what you mean. Do you mean the cost of the evaluation of safe && result < k? If so, then, this (a) should be negligible, and (b) actually is part of the query and should be included in the reported time.

Great! That's what I was referring to. The point is that check wasn't present in the previous version of --extract, and I wasn't sure whether there was a specific reason for it.

elshize · 2025-11-22T20:23:57Z

I personally think it's unnecessary but if you really want to only aggregate the summary, this needs to be explicitly named in a way it doesn't leave any doubt as to what it does. --aggregate-by is not good enough. --summarize-by maybe? Not sure, but must be explicit about what we're aggregating.

elshize · 2025-12-26T02:01:03Z

I personally think using --output is fine because it's the main output and because it's all going to be documented under --help. I think --query-times-output is overly verbose. If we want to be more precise, perhaps --data-output/-o?

gustingonzalez · 2025-12-26T04:13:08Z

@elshize, got it!
One more question: do you think it makes sense to keep the aggregation concept, or should we switch to summary?

elshize · 2025-12-26T16:56:36Z

Because we want to print multiple summaries (for different agg functions), we'll need to name it somehow to print in JSON:

{"agg_per_query": "none", ...}
{"agg_per_query": "min", ...}

Not sure if there's maybe a better name for that, I'm certainly open for suggestions.

Summary is a different thing for me, the printed statistics is summary, and if we always print it, then we don't need to label it, but we may need to use that term in code, docs, or CLI help.

…uery type)

gustingonzalez · 2025-12-26T21:46:18Z

Hi, guys, ready with the changes.

One thing that hadn't been taken into account is that more than algorithm (query type) can be specified. Therefore, the changes now support specifying more than one output file (one for each query type specified).

The following is an example of execution and its output:

./build/bin/queries --encoding block_interpolative --index /path/to/index.block_interpolative -k 10 --algorithm or:and --scorer bm25 -q /path/to/queries.txt -o data-or.csv:data-and.csv
[2025-12-26 18:28:03.714] [stderr] [info] Warming up posting lists...
[2025-12-26 18:28:03.809] [stderr] [info] Per-run query output will be saved to 'data-or.csv'.
[2025-12-26 18:28:03.809] [stderr] [info] Performing 3 runs for 'or' queries...
{
  "encoding": "block_interpolative",
  "algorithm": "or",
  "runs": 3,
  "k": 10,
  "safe": false,
  "corrective_reruns": 0,
  "query_aggregation": {
    "none": {"mean": 17183.4, "q50": 8942, "q90": 41413, "q95": 46643, "q99": 62398},
    "min": {"mean": 14845.6, "q50": 7693, "q90": 39044, "q95": 45186, "q99": 58988},
    "mean": {"mean": 17183.1, "q50": 9266, "q90": 39809, "q95": 46401, "q99": 63660},
    "median": {"mean": 18283.7, "q50": 9471, "q90": 42155, "q95": 46711, "q99": 63830},
    "max": {"mean": 18421, "q50": 9565, "q90": 42421, "q95": 46960, "q99": 63889}
  }
}
[2025-12-26 18:29:13.680] [stderr] [info] Per-run query output will be saved to 'data-and.csv'.
[2025-12-26 18:29:13.680] [stderr] [info] Performing 3 runs for 'and' queries...
{
  "encoding": "block_interpolative",
  "algorithm": "and",
  "runs": 3,
  "k": 10,
  "safe": false,
  "corrective_reruns": 0,
  "query_aggregation": {
    "none": {"mean": 6499.02, "q50": 1630, "q90": 20539, "q95": 31282, "q99": 46491},
    "min": {"mean": 4257.8, "q50": 1111, "q90": 14786, "q95": 18986, "q99": 28139},
    "mean": {"mean": 6498.68, "q50": 1703, "q90": 21923, "q95": 30309, "q99": 40328},
    "median": {"mean": 6898.12, "q50": 1768, "q90": 22582, "q95": 33420, "q99": 46509},
    "max": {"mean": 8341.13, "q50": 2181, "q90": 27297, "q95": 39730, "q99": 51367}
  }
}

Let me know if this is OK or if any changes are needed.

elshize · 2025-12-26T22:03:03Z

One thing that hadn't been taken into account is that more than algorithm (query type) can be specified. Therefore, the changes now support specifying more than one output file (one for each query type specified).

Why not just have an "algorithm" column in the output file? I would rather avoid multiple output files. First, I would say it's no more convenient, if not less convenient than having one. It's so easy to filter out with your dataframe framework of choice, or whatever one uses for crunching data. Furthermore, now you have to worry about ensuring that the number of algorithms is the same as the number of output files, which is just a headache.

I would simply print a column header (are we printing it now or no?) and then values. We can keep it TSV.

Regarding summaries, I think it's better to have something like this:

  "times": [
    {"query_aggregation": "none", "mean": 6499.02, "q50": 1630, "q90": 20539, "q95": 31282, "q99": 46491},
    {"query_aggregation": "min", "mean": 4257.8, "q50": 1111, "q90": 14786, "q95": 18986, "q99": 28139},
    {"query_aggregation": "mean", "mean": 6498.68, "q50": 1703, "q90": 21923, "q95": 30309, "q99": 40328},
    {"query_aggregation": "median", "mean": 6898.12, "q50": 1768, "q90": 22582, "q95": 33420, "q99": 46509},
    {"query_aggregation": "max", "mean": 8341.13, "q50": 2181, "q90": 27297, "q95": 39730, "q99": 51367}
  ]

I think the mapping query_aggregation->min->{} is not clear on what these objects actually contain, while times->min->{} is not clear about what "min" is.

elshize · 2025-12-26T22:05:42Z

tools/queries.cpp

-    } else {
-        std::sort(query_times.begin(), query_times.end());
-        double avg =
+    // Print JSON summary


We should avoid formatting JSON by hand, we have a library for that in our deps already (#include <nlohmann/json.hpp>), which allows you to define JSON similar to a map, and then print it.

elshize · 2025-12-26T22:08:01Z

I left another comment for the code, but I'll need to come back to this later, just letting you know I have not gone through all of the code yet.

gustingonzalez · 2025-12-27T04:43:41Z

@elshize, ready with the changes.

One observation is that the JSON now is printed in an unordered way. Altough new versions of <nlohmann/json.hpp> includes an ordered_json object, I can't "easily" update the library because is included via the https://github.com/pisa-engine/warcpp submodule. If you think it's necessary, we could consider updating it.

Below is an example of the current JSON output:

{
  "algorithm": "or",
  "corrective_reruns": 0,
  "encoding": "block_interpolative",
  "k": 10,
  "runs": 1,
  "safe": false,
  "times": [
    {
      "mean": 8665.901,
      "q50": 4529.0,
      "q90": 19855.0,
      "q95": 22181.0,
      "q99": 30635.0,
      "query_aggregation": "none"
    },
    {
      "mean": 8665.901,
      "q50": 4529.0,
      "q90": 19855.0,
      "q95": 22181.0,
      "q99": 30635.0,
      "query_aggregation": "min"
    },
    {
      "mean": 8665.901,
      "q50": 4529.0,
      "q90": 19855.0,
      "q95": 22181.0,
      "q99": 30635.0,
      "query_aggregation": "mean"
    },
    {
      "mean": 8665.901,
      "q50": 4529.0,
      "q90": 19855.0,
      "q95": 22181.0,
      "q99": 30635.0,
      "query_aggregation": "median"
    },
    {
      "mean": 8665.901,
      "q50": 4529.0,
      "q90": 19855.0,
      "q95": 22181.0,
      "q99": 30635.0,
      "query_aggregation": "max"
    }
  ]
}

elshize · 2025-12-27T16:30:40Z

It's a little unfortunate that we can't control the order but on the other hand, I don't think it's that crucial, especially with pretty-printing. Also, not sure if APIs guarantee it, but it looks like it's not so much unordered as lexicographically ordered.

We might want to work on upgrading the dependency anyway, but I don't think it's necessary as part of this work. I think the JSON output you provided above is fine.

One other thing we have to consider is that now we are printing potentially multiple JSON objects, each in multiple lines, so it's no longer in JSONL format. This may limit what out-of-the-box tools one can use to parse the output. I typically use jq for parsing my JSON outputs, which supports multiple multi-line JSONs but I'm not sure if, say, pandas.read_json supports it (I didn't check but the docs make it sound like it needs to be line-by-line).

Ultimately, I don't think this is a big issue. You can always use queries ... | jq -c > summaries.jsonl, and I really would expect people to use the raw results if they intend to do any non-trivial processing.

That said, we could address this as well. I see two options off the top of my head.

One is to have a flag --pretty-print (or --no-pretty-print) but it's complicated by the fact we have two outputs, and I would like to avoid some very long complex names if possible. Plus, as I mention above, this can be easily achieved by piping it to jq -c so it feels unnecessary. I know jq is a third party tool, but ubiquitous and available in any Linux package manager I know and on brew for mac.

The other approach would be to print a single JSON and put all summaries in an array:

{
  "summaries": [
    {
      "algorithm": "or",
      ...
    },
    {
      "algorithm": "and",
      ...
    },
  ]
}

But to be clear, I'm ok with leaving the output as is now.

elshize

Leaving some more comments, but still haven't gone through the entire PR.

elshize · 2025-12-27T16:37:39Z