Skip to content

Conversation

@gustingonzalez
Copy link
Collaborator

@gustingonzalez gustingonzalez commented Nov 19, 2025

Key changes in this pull request:

  1. Replaces --extract option by --output, which now requires an explicit output file. Since more than one algorithm (query type) could be specified, the algorithm is now also printed in the TSV.
  2. The original op_perftest() behavior remains available when --output is not specified.
  3. Adds a new --runs option to specify the number of runs to measure the query set (by default: 3). Note that this parameter excludes warmup.
  4. The summary now includes per-run query timing aggregation using none, min, mean, median and max as aggregation types.
  5. Measures each query once per run as with the former op_perftest(), so the set of queries is evaluated independently in each run.

Copy link
Member

@elshize elshize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not entirely what I had in mind. I think the aggregate function should either be applied to everything or we should just not aggregate at all. Otherwise, it's just too complex to keep track of what is what.

I was also thinking we either summarize or extract. But I think it's fine to just use different streams, only then I would remove the option to summarize only and just always summarize.

Let's discuss this a little more.

@JMMackenzie do you think it makes sense to extract results after aggregation? Say, return min for each query instead of R results where R is num of runs?

If not, then maybe it's best to just always extract everything and always print out summary to stderr and maybe that summary will always be (a) no aggregate, (b) min aggregate, and (c) mean aggregate? I frankly see no use for max or median. What are your thoughts?

If it makes sense to aggregate the actual output data, then I think there should always be 1 aggregate applied to both data and summary, or no aggregate at all.

case AggregationType::Median: return "median";
case AggregationType::Max: return "max";
}
return "unknown";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not return a value. Let's throw an exception instead. Maybe std::logic_error?

@JMMackenzie
Copy link
Member

@JMMackenzie do you think it makes sense to extract results after aggregation? Say, return min for each query instead of R results where R is num of runs?

I think this could be reasonable, and I do think this was what I had initially envisaged. But I can see the benefit of extracing everything to stderr, and then also allowing a separate "aggregated" stream to either file or to stdout? I agree that if there is an aggregated stream being output, then a summary should also use that same aggregation. Maybe we can make a new "results" format where we dump both file.agg.summary and file.agg.stream?

@elshize
Copy link
Member

elshize commented Nov 20, 2025

I would actually try to avoid introducing new formats, I'd like to keep it as simple as possible, and as unsurprising as possible as well.

I think the most important is to give the user the raw data, which they can process however they want. I think we all agree on this part.

Then, I would lean towards simplicity:

  • I think having a single type of output (data vs summary) is much less ambiguous, so I lean towards having either one or the other. I'm thinking of the summary as a convenience during prototyping rather than actual serious data gathering.
  • Having the ability to aggregate for a query may be crucial for summary, and maybe useful for extracting data.
  • I think we should think hard what agg functions make sense, and not include those that don't. I'm struggling to find use for max or median. I think min and mean makes sense.

We will not support all types outputs from this tool but that's ok. This is why we give the user raw output.

Given the above, I would suggest the following algorithm:

results = run_benchmark(...)
if (aggregate) {
  results = aggregate_per_query(results, agg_function)
}
if (summarize) print_summary(results)
else print_data(results)

You have a few choices for implementing this.

One is having results as "rows", i.e., a vector of structs describing everything you need, including the query ID, and then aggregate_per_query can internally do group-by (could just use a hash map) and aggregate.

The other would be to have results as nested vectors and then after aggregation you still get nested vectors, only each inner vector has one element.

agg_function can be take multiple results and output multiple results as well, this way you can treat no aggregate as a type of aggregate (identity) -- maybe "transform" is a better name? I would prefer creating that transform function first (using CLI options) and then pass that and apply it for each query group, as opposed to pass name and have all the if-else conditions nested. It would be much cleaner that way, but it requires some design considerations.

print_data and print_summary don't need to know anything about how results were produced or transformed, only how to print them (or calculate stats).

@gustingonzalez
Copy link
Collaborator Author

gustingonzalez commented Nov 20, 2025

I think that, for the case, the median could provide more robustness than mean; for example, by suppressing atypical cases or noise (mostly related to the maximum values across the runs).

As for the max aggregation, I don't now if it is really useful (maybe to capture take the worst cases?), but it may ultimately not be representative; I just included it because its implementation required no additional effort. If it has no real usefulness, I think it should be removed.

Regarding the methodology for printing data or a summary, I think it is useful to show the summary when extracting. In this case, if some of the metrics satisfy the user's needs, there is no need to run an external script (for that reason I think is useful to print all defined metrics when no aggregation/transformation is specified).

Also, given that the query times are printed to the output, they can simply be exported using redirection (>). If that is confusing, a file path could be used instead, but I think redirection keeps it simple. In the case of printing just the summary (for example, for a quick check of a prototype), I agree with it not makes sense to print the query times.

@gustingonzalez gustingonzalez marked this pull request as draft November 20, 2025 03:14
@elshize
Copy link
Member

elshize commented Nov 21, 2025

I think that, for the case, the median could provide more robustness than mean; for example, by suppressing atypical cases or noise (mostly related to the maximum values across the runs).

That's fair.


My main concern is making it too complex. How about we always extract all queries (user can process that data themselves) and always print all summaries?

I really don't want to go the route of defining aggregate function that will only apply to one or the other.

Just note that if you use stderr for summaries, we can't pipe it to another tool for transformation because redirect will capture the rest of the logs, so it will be purely informative.

@gustingonzalez
Copy link
Collaborator Author

The behavior in which all runs (together with all summaries) are extracted occurs when aggregation is set to none (or not specified): in this case, the tool shows the summary with the runs aggregated by min, mean, etc.

However, another experiment could be: "I want just to know what happens when all values are the minimum". Although I can obtain this value from the summary output, if I specify aggregation by min, it makes sense that the summary and the output adapt to that scenario, so I don’t need to implement an specific script to reprocess the output data (even though I understand such a script would be simple). This can be useful for quick experiments; if I want to understand the causes behind this value, because I can quickly analyze the file that already contains all the minimum values.

In any case, I understand that this may introduce unnecessary complexity from a SRP perspective, and an intermediate option would be to remove the --aggregate-by parameter and always show all the data with the summary for the different values, as mentioned.

What do you think @elshize, @JMMackenzie?

auto usecs = run_with_timer<std::chrono::microseconds>([&]() {
uint64_t result = query_func(query, thresholds[idx]);
uint64_t result = query_func(query, thresholds[query_idx]);
if (safe && result < k) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elshize, do you think it is convenient to define another path to avoid performing this check for each query run? I'm referring to the cost of evaluating it (which, in any case, is a "constant cost" that will be added to all runs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understand what you mean. Do you mean the cost of the evaluation of safe && result < k? If so, then, this (a) should be negligible, and (b) actually is part of the query and should be included in the reported time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! That's what I was referring to. The point is that check wasn't present in the previous version of --extract, and I wasn't sure whether there was a specific reason for it.

@elshize
Copy link
Member

elshize commented Nov 22, 2025

I personally think it's unnecessary but if you really want to only aggregate the summary, this needs to be explicitly named in a way it doesn't leave any doubt as to what it does. --aggregate-by is not good enough. --summarize-by maybe? Not sure, but must be explicit about what we're aggregating.

@elshize
Copy link
Member

elshize commented Dec 26, 2025

I personally think using --output is fine because it's the main output and because it's all going to be documented under --help. I think --query-times-output is overly verbose. If we want to be more precise, perhaps --data-output/-o?

@gustingonzalez
Copy link
Collaborator Author

@elshize, got it!
One more question: do you think it makes sense to keep the aggregation concept, or should we switch to summary?

@elshize
Copy link
Member

elshize commented Dec 26, 2025

Because we want to print multiple summaries (for different agg functions), we'll need to name it somehow to print in JSON:

{"agg_per_query": "none", ...}
{"agg_per_query": "min", ...}

Not sure if there's maybe a better name for that, I'm certainly open for suggestions.

Summary is a different thing for me, the printed statistics is summary, and if we always print it, then we don't need to label it, but we may need to use that term in code, docs, or CLI help.

@gustingonzalez
Copy link
Collaborator Author

Hi, guys, ready with the changes.

One thing that hadn't been taken into account is that more than algorithm (query type) can be specified. Therefore, the changes now support specifying more than one output file (one for each query type specified).

The following is an example of execution and its output:

./build/bin/queries --encoding block_interpolative --index /path/to/index.block_interpolative -k 10 --algorithm or:and --scorer bm25 -q /path/to/queries.txt -o data-or.csv:data-and.csv
[2025-12-26 18:28:03.714] [stderr] [info] Warming up posting lists...
[2025-12-26 18:28:03.809] [stderr] [info] Per-run query output will be saved to 'data-or.csv'.
[2025-12-26 18:28:03.809] [stderr] [info] Performing 3 runs for 'or' queries...
{
  "encoding": "block_interpolative",
  "algorithm": "or",
  "runs": 3,
  "k": 10,
  "safe": false,
  "corrective_reruns": 0,
  "query_aggregation": {
    "none": {"mean": 17183.4, "q50": 8942, "q90": 41413, "q95": 46643, "q99": 62398},
    "min": {"mean": 14845.6, "q50": 7693, "q90": 39044, "q95": 45186, "q99": 58988},
    "mean": {"mean": 17183.1, "q50": 9266, "q90": 39809, "q95": 46401, "q99": 63660},
    "median": {"mean": 18283.7, "q50": 9471, "q90": 42155, "q95": 46711, "q99": 63830},
    "max": {"mean": 18421, "q50": 9565, "q90": 42421, "q95": 46960, "q99": 63889}
  }
}
[2025-12-26 18:29:13.680] [stderr] [info] Per-run query output will be saved to 'data-and.csv'.
[2025-12-26 18:29:13.680] [stderr] [info] Performing 3 runs for 'and' queries...
{
  "encoding": "block_interpolative",
  "algorithm": "and",
  "runs": 3,
  "k": 10,
  "safe": false,
  "corrective_reruns": 0,
  "query_aggregation": {
    "none": {"mean": 6499.02, "q50": 1630, "q90": 20539, "q95": 31282, "q99": 46491},
    "min": {"mean": 4257.8, "q50": 1111, "q90": 14786, "q95": 18986, "q99": 28139},
    "mean": {"mean": 6498.68, "q50": 1703, "q90": 21923, "q95": 30309, "q99": 40328},
    "median": {"mean": 6898.12, "q50": 1768, "q90": 22582, "q95": 33420, "q99": 46509},
    "max": {"mean": 8341.13, "q50": 2181, "q90": 27297, "q95": 39730, "q99": 51367}
  }
}

Let me know if this is OK or if any changes are needed.

@elshize
Copy link
Member

elshize commented Dec 26, 2025

One thing that hadn't been taken into account is that more than algorithm (query type) can be specified. Therefore, the changes now support specifying more than one output file (one for each query type specified).

Why not just have an "algorithm" column in the output file? I would rather avoid multiple output files. First, I would say it's no more convenient, if not less convenient than having one. It's so easy to filter out with your dataframe framework of choice, or whatever one uses for crunching data. Furthermore, now you have to worry about ensuring that the number of algorithms is the same as the number of output files, which is just a headache.

I would simply print a column header (are we printing it now or no?) and then values. We can keep it TSV.

Regarding summaries, I think it's better to have something like this:

  "times": [
    {"query_aggregation": "none", "mean": 6499.02, "q50": 1630, "q90": 20539, "q95": 31282, "q99": 46491},
    {"query_aggregation": "min", "mean": 4257.8, "q50": 1111, "q90": 14786, "q95": 18986, "q99": 28139},
    {"query_aggregation": "mean", "mean": 6498.68, "q50": 1703, "q90": 21923, "q95": 30309, "q99": 40328},
    {"query_aggregation": "median", "mean": 6898.12, "q50": 1768, "q90": 22582, "q95": 33420, "q99": 46509},
    {"query_aggregation": "max", "mean": 8341.13, "q50": 2181, "q90": 27297, "q95": 39730, "q99": 51367}
  ]

I think the mapping query_aggregation->min->{} is not clear on what these objects actually contain, while times->min->{} is not clear about what "min" is.

} else {
std::sort(query_times.begin(), query_times.end());
double avg =
// Print JSON summary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid formatting JSON by hand, we have a library for that in our deps already (#include <nlohmann/json.hpp>), which allows you to define JSON similar to a map, and then print it.

@elshize
Copy link
Member

elshize commented Dec 26, 2025

I left another comment for the code, but I'll need to come back to this later, just letting you know I have not gone through all of the code yet.

@gustingonzalez
Copy link
Collaborator Author

@elshize, ready with the changes.

One observation is that the JSON now is printed in an unordered way. Altough new versions of <nlohmann/json.hpp> includes an ordered_json object, I can't "easily" update the library because is included via the https://github.com/pisa-engine/warcpp submodule. If you think it's necessary, we could consider updating it.

Below is an example of the current JSON output:

{
  "algorithm": "or",
  "corrective_reruns": 0,
  "encoding": "block_interpolative",
  "k": 10,
  "runs": 1,
  "safe": false,
  "times": [
    {
      "mean": 8665.901,
      "q50": 4529.0,
      "q90": 19855.0,
      "q95": 22181.0,
      "q99": 30635.0,
      "query_aggregation": "none"
    },
    {
      "mean": 8665.901,
      "q50": 4529.0,
      "q90": 19855.0,
      "q95": 22181.0,
      "q99": 30635.0,
      "query_aggregation": "min"
    },
    {
      "mean": 8665.901,
      "q50": 4529.0,
      "q90": 19855.0,
      "q95": 22181.0,
      "q99": 30635.0,
      "query_aggregation": "mean"
    },
    {
      "mean": 8665.901,
      "q50": 4529.0,
      "q90": 19855.0,
      "q95": 22181.0,
      "q99": 30635.0,
      "query_aggregation": "median"
    },
    {
      "mean": 8665.901,
      "q50": 4529.0,
      "q90": 19855.0,
      "q95": 22181.0,
      "q99": 30635.0,
      "query_aggregation": "max"
    }
  ]
}

@elshize
Copy link
Member

elshize commented Dec 27, 2025

It's a little unfortunate that we can't control the order but on the other hand, I don't think it's that crucial, especially with pretty-printing. Also, not sure if APIs guarantee it, but it looks like it's not so much unordered as lexicographically ordered.

We might want to work on upgrading the dependency anyway, but I don't think it's necessary as part of this work. I think the JSON output you provided above is fine.


One other thing we have to consider is that now we are printing potentially multiple JSON objects, each in multiple lines, so it's no longer in JSONL format. This may limit what out-of-the-box tools one can use to parse the output. I typically use jq for parsing my JSON outputs, which supports multiple multi-line JSONs but I'm not sure if, say, pandas.read_json supports it (I didn't check but the docs make it sound like it needs to be line-by-line).

Ultimately, I don't think this is a big issue. You can always use queries ... | jq -c > summaries.jsonl, and I really would expect people to use the raw results if they intend to do any non-trivial processing.

That said, we could address this as well. I see two options off the top of my head.

One is to have a flag --pretty-print (or --no-pretty-print) but it's complicated by the fact we have two outputs, and I would like to avoid some very long complex names if possible. Plus, as I mention above, this can be easily achieved by piping it to jq -c so it feels unnecessary. I know jq is a third party tool, but ubiquitous and available in any Linux package manager I know and on brew for mac.

The other approach would be to print a single JSON and put all summaries in an array:

{
  "summaries": [
    {
      "algorithm": "or",
      ...
    },
    {
      "algorithm": "and",
      ...
    },
  ]
}

But to be clear, I'm ok with leaving the output as is now.

Copy link
Member

@elshize elshize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some more comments, but still haven't gone through the entire PR.

@@ -0,0 +1,82 @@
#include "spdlog/spdlog.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the license comment at the top of the file:

// Copyright 2025 PISA Developers
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

This is something we should be doing, even if we haven't done it for all older files yet. We should probably run a script for that or something... Not an issue for this PR though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved for queries.cpp.

@@ -0,0 +1,82 @@
#include "spdlog/spdlog.h"

#include "index_types.hpp"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to order our includes as follows:

  1. C++ standard library
  2. Third party libraries
  3. PISA includes

Each group should be separated by a new line (precisely as you did, just in different order).

Comment on lines +9 to +10
using pisa::do_not_optimize_away;
using pisa::get_time_usecs;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is not a big issue, I think we should use using sparingly, and only if it really improves readability. Here, we only use these functions several times and they don't elide much. I think there's value in being explicit by qualifying each use with its namespace:

auto start = pisa::get_time_usecs();

Copy link
Collaborator Author

@gustingonzalez gustingonzalez Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! perftest_decoding is an internal script that I'm using for test some experiments what I'm doing. I'll delete that, but if it is useful, I can include it, but I think it should be better included in a separate MR, as it's outside the scope of this one ("Improve queries tool"). Alternatively, we could adjust the scope of this MR.

// Puts pointer on next posting.
plist.next();

// Loads docid and freq (if required) of current posting. On the other hand,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "on the other hand" makes much sense here, and also I don't think we need to explain what do_not_optimize_away does, I think it's pretty self-explanatory, and we use it everywhere else without explanation.

If anything, we can add a doc comment to the function definition explaining what it does but we do not need to explain when we use it.

Copy link
Collaborator Author

@gustingonzalez gustingonzalez Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for these comments is explained above (old internal script, sorry!), but if we decide to include it, I can make the changes.

// the compiler may decide not to execute the following sentences because the
// result isn't used.
do_not_optimize_away(plist.docid());
if (decode_freqs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you wanted is:

if constexpr (decode_freqs) {

This will simply remove this branch for the version of the function with decode_freqs=false, so we won't be measuring the branch decision itself.

I don't expect this branch to affect the results in any noticeable way (after all, the loop itself has conditions as well and we're ok with it), but the only reason I see for passing decode_freqs as a template parameter is if you want to evaluate it in constexpr context. Otherwise this could simply be a function argument.

Functor query_func,
template <typename Fn>
void extract_times(
Fn query_func,
Copy link
Member

@elshize elshize Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to refactor this a little. It would be cleaner if we actually extract all times first, then summarize and/or print. Something like:

extracted_times = extract_times(...);
summarize(extracted_times, ...);
if (should_print) {
    print_times(extracted_times);
}

extracted_times will likely have to be some kind of struct or vector of sturcts, or something like that, because you'll need times and corrective run count.

Copy link
Member

@elshize elshize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, finished going through it, left some comments.


A general note:

I would discourage the nesting doll style where you keep passing slightly modified parameters down and the next function slightly moves forward the entire logic. It's usually much clearer if we break down our programs into sub-programs and stick to separation of concerns.

For example, extracting times should typically have nothing to do with printing or summarizing them, so the extracting function should not get the output stream at all.

If we break things down, they typically easier to reason about. There are many reasons for that, including: the functions take fewer parameters, we are forced to return some meaningful types, understanding a function in isolation is much easier than globally, etc. One particular thing we should strive for is to keep mutable state contained, and as much as we can try to have pure functions doing complex logic, so we can deterministically predict what happens based on parameters. Of course, benchmarks are not deterministic in what values they produce, but I'm talking about all the rest.

Note that this nested type of functions are quite common in this code base, especially in legacy code, but we should fight it and break away from it as much as possible.

stats_line()("type", index_type)("query", query_type)("avg", avg)("q50", q50)("q90", q90)(
"q95", q95)("q99", q99);
// Save times per query (if required)
if (os != nullptr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, I'd like to isolate this from extracting the data itself.

Once that's done, it should look something like this:

if (should_print) {
    auto os = open_file()
    // write results
}

We do away with a potentially-null pointer altogether.

spdlog::info("K: {}", k);
std::ofstream output_file;
std::ostream* output = nullptr;
if (output_path) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite what I outlined in the other comment, I think it's not a bad idea to open the file first and then extract, so that we can fail quickly if there's an I/O issue when opening the stream.

That said, I would instead use an optional rather than a pointer. For example:

std::optional<std::ofstream> out = [&output_path]() -> std::optional<std::ofstream> {
    if (output_path.has_value()) {
        auto out = std::ofstream(*output_path);
        // Do necessary checks...
        return std::move(out);
    }
    return std::nullopt;
}();

Or:

std::optional<std::ofstream> out = std::nullopt;
if (output_path) {
    out = std::ofstream(*output_path);
    // Do necessary checks
}

The second one may be cleaner if you're inlining it here. The first one could be a better option if you decide to extract it to a separate function. I think it's ok to leave it here and use the second approach.

output_file.open(*output_path);
if (!output_file.is_open()) {
const auto err_msg = fmt::format("Failed to open output file: {}.", *output_path);
spdlog::error(err_msg);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rather than printing this error log here, we should make sure that main() function catches any exception and prints the error message.

Comment on lines +257 to +263
output_file << "algorithm\tqid";
for (size_t i = 1; i <= runs; ++i) {
output_file << fmt::format("\tusec{}", i);
}
output_file << "\n";

spdlog::info("Per-run query output will be saved to '{}'.", *output_path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we can do this when resolving the file stream, I think it's better to do it below:

if (output) {
    spdlog::info("Per-run query output will be saved to '{}'.", *output_path);
    // print header
}

@gustingonzalez
Copy link
Collaborator Author

Thank you for your review @elshize, I'll work on that.

@gustingonzalez
Copy link
Collaborator Author

It's a little unfortunate that we can't control the order but on the other hand, I don't think it's that crucial, especially with pretty-printing. Also, not sure if APIs guarantee it, but it looks like it's not so much unordered as lexicographically ordered.

We might want to work on upgrading the dependency anyway, but I don't think it's necessary as part of this work. I think the JSON output you provided above is fine.

One other thing we have to consider is that now we are printing potentially multiple JSON objects, each in multiple lines, so it's no longer in JSONL format. This may limit what out-of-the-box tools one can use to parse the output. I typically use jq for parsing my JSON outputs, which supports multiple multi-line JSONs but I'm not sure if, say, pandas.read_json supports it (I didn't check but the docs make it sound like it needs to be line-by-line).

Ultimately, I don't think this is a big issue. You can always use queries ... | jq -c > summaries.jsonl, and I really would expect people to use the raw results if they intend to do any non-trivial processing.

That said, we could address this as well. I see two options off the top of my head.

One is to have a flag --pretty-print (or --no-pretty-print) but it's complicated by the fact we have two outputs, and I would like to avoid some very long complex names if possible. Plus, as I mention above, this can be easily achieved by piping it to jq -c so it feels unnecessary. I know jq is a third party tool, but ubiquitous and available in any Linux package manager I know and on brew for mac.

The other approach would be to print a single JSON and put all summaries in an array:

{
  "summaries": [
    {
      "algorithm": "or",
      ...
    },
    {
      "algorithm": "and",
      ...
    },
  ]
}

But to be clear, I'm ok with leaving the output as is now.

@elshize, I tested this with data = pd.read_json('example.json', typ='series') and it works fine (unless there is another issue I'm not seeing). However, is the project is using JSONL, we could switch to that to keep consistency. In fact, one advantage of JSONL is that it allows putting all runs for different algorithms into the same file, which could be a good reason to keep JSONL.

@elshize
Copy link
Member

elshize commented Jan 2, 2026

However, is the project is using JSONL, we could switch to that to keep consistency.

The thing is that if we use JSONL, then we readability is worse because everything is crammed to a line. But then again, if someone requires pretty-printing, they can also just pipe it to jq with no arguments.

I'll leave it up to you, I'm ok with pretty printed JSONs, JSONL, or with a single JSON with a list, and I have no strong feelings towards any of these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants