-
-
Notifications
You must be signed in to change notification settings - Fork 72
Improve queries tool
#624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Improve queries tool
#624
Conversation
elshize
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not entirely what I had in mind. I think the aggregate function should either be applied to everything or we should just not aggregate at all. Otherwise, it's just too complex to keep track of what is what.
I was also thinking we either summarize or extract. But I think it's fine to just use different streams, only then I would remove the option to summarize only and just always summarize.
Let's discuss this a little more.
@JMMackenzie do you think it makes sense to extract results after aggregation? Say, return min for each query instead of R results where R is num of runs?
If not, then maybe it's best to just always extract everything and always print out summary to stderr and maybe that summary will always be (a) no aggregate, (b) min aggregate, and (c) mean aggregate? I frankly see no use for max or median. What are your thoughts?
If it makes sense to aggregate the actual output data, then I think there should always be 1 aggregate applied to both data and summary, or no aggregate at all.
tools/queries.cpp
Outdated
| case AggregationType::Median: return "median"; | ||
| case AggregationType::Max: return "max"; | ||
| } | ||
| return "unknown"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not return a value. Let's throw an exception instead. Maybe std::logic_error?
I think this could be reasonable, and I do think this was what I had initially envisaged. But I can see the benefit of extracing everything to stderr, and then also allowing a separate "aggregated" stream to either file or to stdout? I agree that if there is an aggregated stream being output, then a summary should also use that same aggregation. Maybe we can make a new "results" format where we dump both |
|
I would actually try to avoid introducing new formats, I'd like to keep it as simple as possible, and as unsurprising as possible as well. I think the most important is to give the user the raw data, which they can process however they want. I think we all agree on this part. Then, I would lean towards simplicity:
We will not support all types outputs from this tool but that's ok. This is why we give the user raw output. Given the above, I would suggest the following algorithm: You have a few choices for implementing this. One is having results as "rows", i.e., a vector of structs describing everything you need, including the query ID, and then The other would be to have results as nested vectors and then after aggregation you still get nested vectors, only each inner vector has one element.
|
|
I think that, for the case, the median could provide more robustness than mean; for example, by suppressing atypical cases or noise (mostly related to the maximum values across the runs). As for the max aggregation, I don't now if it is really useful (maybe to capture take the worst cases?), but it may ultimately not be representative; I just included it because its implementation required no additional effort. If it has no real usefulness, I think it should be removed. Regarding the methodology for printing data or a summary, I think it is useful to show the summary when extracting. In this case, if some of the metrics satisfy the user's needs, there is no need to run an external script (for that reason I think is useful to print all defined metrics when no aggregation/transformation is specified). Also, given that the query times are printed to the output, they can simply be exported using redirection ( |
That's fair. My main concern is making it too complex. How about we always extract all queries (user can process that data themselves) and always print all summaries? I really don't want to go the route of defining aggregate function that will only apply to one or the other. Just note that if you use stderr for summaries, we can't pipe it to another tool for transformation because redirect will capture the rest of the logs, so it will be purely informative. |
|
The behavior in which all runs (together with all summaries) are extracted occurs when aggregation is set to However, another experiment could be: "I want just to know what happens when all values are the minimum". Although I can obtain this value from the summary output, if I specify aggregation by min, it makes sense that the summary and the output adapt to that scenario, so I don’t need to implement an specific script to reprocess the output data (even though I understand such a script would be simple). This can be useful for quick experiments; if I want to understand the causes behind this value, because I can quickly analyze the file that already contains all the minimum values. In any case, I understand that this may introduce unnecessary complexity from a SRP perspective, and an intermediate option would be to remove the What do you think @elshize, @JMMackenzie? |
| auto usecs = run_with_timer<std::chrono::microseconds>([&]() { | ||
| uint64_t result = query_func(query, thresholds[idx]); | ||
| uint64_t result = query_func(query, thresholds[query_idx]); | ||
| if (safe && result < k) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elshize, do you think it is convenient to define another path to avoid performing this check for each query run? I'm referring to the cost of evaluating it (which, in any case, is a "constant cost" that will be added to all runs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I understand what you mean. Do you mean the cost of the evaluation of safe && result < k? If so, then, this (a) should be negligible, and (b) actually is part of the query and should be included in the reported time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! That's what I was referring to. The point is that check wasn't present in the previous version of --extract, and I wasn't sure whether there was a specific reason for it.
|
I personally think it's unnecessary but if you really want to only aggregate the summary, this needs to be explicitly named in a way it doesn't leave any doubt as to what it does. |
|
I personally think using |
|
@elshize, got it! |
|
Because we want to print multiple summaries (for different agg functions), we'll need to name it somehow to print in JSON: {"agg_per_query": "none", ...}
{"agg_per_query": "min", ...}Not sure if there's maybe a better name for that, I'm certainly open for suggestions. Summary is a different thing for me, the printed statistics is summary, and if we always print it, then we don't need to label it, but we may need to use that term in code, docs, or CLI help. |
592a782 to
3052d8f
Compare
|
Hi, guys, ready with the changes. One thing that hadn't been taken into account is that more than algorithm (query type) can be specified. Therefore, the changes now support specifying more than one output file (one for each query type specified). The following is an example of execution and its output: Let me know if this is OK or if any changes are needed. |
Why not just have an "algorithm" column in the output file? I would rather avoid multiple output files. First, I would say it's no more convenient, if not less convenient than having one. It's so easy to filter out with your dataframe framework of choice, or whatever one uses for crunching data. Furthermore, now you have to worry about ensuring that the number of algorithms is the same as the number of output files, which is just a headache. I would simply print a column header (are we printing it now or no?) and then values. We can keep it TSV. Regarding summaries, I think it's better to have something like this: "times": [
{"query_aggregation": "none", "mean": 6499.02, "q50": 1630, "q90": 20539, "q95": 31282, "q99": 46491},
{"query_aggregation": "min", "mean": 4257.8, "q50": 1111, "q90": 14786, "q95": 18986, "q99": 28139},
{"query_aggregation": "mean", "mean": 6498.68, "q50": 1703, "q90": 21923, "q95": 30309, "q99": 40328},
{"query_aggregation": "median", "mean": 6898.12, "q50": 1768, "q90": 22582, "q95": 33420, "q99": 46509},
{"query_aggregation": "max", "mean": 8341.13, "q50": 2181, "q90": 27297, "q95": 39730, "q99": 51367}
]I think the mapping |
| } else { | ||
| std::sort(query_times.begin(), query_times.end()); | ||
| double avg = | ||
| // Print JSON summary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid formatting JSON by hand, we have a library for that in our deps already (#include <nlohmann/json.hpp>), which allows you to define JSON similar to a map, and then print it.
|
I left another comment for the code, but I'll need to come back to this later, just letting you know I have not gone through all of the code yet. |
|
@elshize, ready with the changes. One observation is that the JSON now is printed in an unordered way. Altough new versions of <nlohmann/json.hpp> includes an Below is an example of the current JSON output: |
|
It's a little unfortunate that we can't control the order but on the other hand, I don't think it's that crucial, especially with pretty-printing. Also, not sure if APIs guarantee it, but it looks like it's not so much unordered as lexicographically ordered. We might want to work on upgrading the dependency anyway, but I don't think it's necessary as part of this work. I think the JSON output you provided above is fine. One other thing we have to consider is that now we are printing potentially multiple JSON objects, each in multiple lines, so it's no longer in JSONL format. This may limit what out-of-the-box tools one can use to parse the output. I typically use Ultimately, I don't think this is a big issue. You can always use That said, we could address this as well. I see two options off the top of my head. One is to have a flag The other approach would be to print a single JSON and put all summaries in an array: {
"summaries": [
{
"algorithm": "or",
...
},
{
"algorithm": "and",
...
},
]
}But to be clear, I'm ok with leaving the output as is now. |
elshize
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving some more comments, but still haven't gone through the entire PR.
| @@ -0,0 +1,82 @@ | |||
| #include "spdlog/spdlog.h" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add the license comment at the top of the file:
// Copyright 2025 PISA Developers
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
This is something we should be doing, even if we haven't done it for all older files yet. We should probably run a script for that or something... Not an issue for this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solved for queries.cpp.
| @@ -0,0 +1,82 @@ | |||
| #include "spdlog/spdlog.h" | |||
|
|
|||
| #include "index_types.hpp" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to order our includes as follows:
- C++ standard library
- Third party libraries
- PISA includes
Each group should be separated by a new line (precisely as you did, just in different order).
| using pisa::do_not_optimize_away; | ||
| using pisa::get_time_usecs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is not a big issue, I think we should use using sparingly, and only if it really improves readability. Here, we only use these functions several times and they don't elide much. I think there's value in being explicit by qualifying each use with its namespace:
auto start = pisa::get_time_usecs();There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! perftest_decoding is an internal script that I'm using for test some experiments what I'm doing. I'll delete that, but if it is useful, I can include it, but I think it should be better included in a separate MR, as it's outside the scope of this one ("Improve queries tool"). Alternatively, we could adjust the scope of this MR.
| // Puts pointer on next posting. | ||
| plist.next(); | ||
|
|
||
| // Loads docid and freq (if required) of current posting. On the other hand, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "on the other hand" makes much sense here, and also I don't think we need to explain what do_not_optimize_away does, I think it's pretty self-explanatory, and we use it everywhere else without explanation.
If anything, we can add a doc comment to the function definition explaining what it does but we do not need to explain when we use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for these comments is explained above (old internal script, sorry!), but if we decide to include it, I can make the changes.
| // the compiler may decide not to execute the following sentences because the | ||
| // result isn't used. | ||
| do_not_optimize_away(plist.docid()); | ||
| if (decode_freqs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you wanted is:
if constexpr (decode_freqs) {This will simply remove this branch for the version of the function with decode_freqs=false, so we won't be measuring the branch decision itself.
I don't expect this branch to affect the results in any noticeable way (after all, the loop itself has conditions as well and we're ok with it), but the only reason I see for passing decode_freqs as a template parameter is if you want to evaluate it in constexpr context. Otherwise this could simply be a function argument.
| Functor query_func, | ||
| template <typename Fn> | ||
| void extract_times( | ||
| Fn query_func, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to refactor this a little. It would be cleaner if we actually extract all times first, then summarize and/or print. Something like:
extracted_times = extract_times(...);
summarize(extracted_times, ...);
if (should_print) {
print_times(extracted_times);
}
extracted_times will likely have to be some kind of struct or vector of sturcts, or something like that, because you'll need times and corrective run count.
elshize
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, finished going through it, left some comments.
A general note:
I would discourage the nesting doll style where you keep passing slightly modified parameters down and the next function slightly moves forward the entire logic. It's usually much clearer if we break down our programs into sub-programs and stick to separation of concerns.
For example, extracting times should typically have nothing to do with printing or summarizing them, so the extracting function should not get the output stream at all.
If we break things down, they typically easier to reason about. There are many reasons for that, including: the functions take fewer parameters, we are forced to return some meaningful types, understanding a function in isolation is much easier than globally, etc. One particular thing we should strive for is to keep mutable state contained, and as much as we can try to have pure functions doing complex logic, so we can deterministically predict what happens based on parameters. Of course, benchmarks are not deterministic in what values they produce, but I'm talking about all the rest.
Note that this nested type of functions are quite common in this code base, especially in legacy code, but we should fight it and break away from it as much as possible.
| stats_line()("type", index_type)("query", query_type)("avg", avg)("q50", q50)("q90", q90)( | ||
| "q95", q95)("q99", q99); | ||
| // Save times per query (if required) | ||
| if (os != nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, I'd like to isolate this from extracting the data itself.
Once that's done, it should look something like this:
if (should_print) {
auto os = open_file()
// write results
}
We do away with a potentially-null pointer altogether.
| spdlog::info("K: {}", k); | ||
| std::ofstream output_file; | ||
| std::ostream* output = nullptr; | ||
| if (output_path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite what I outlined in the other comment, I think it's not a bad idea to open the file first and then extract, so that we can fail quickly if there's an I/O issue when opening the stream.
That said, I would instead use an optional rather than a pointer. For example:
std::optional<std::ofstream> out = [&output_path]() -> std::optional<std::ofstream> {
if (output_path.has_value()) {
auto out = std::ofstream(*output_path);
// Do necessary checks...
return std::move(out);
}
return std::nullopt;
}();Or:
std::optional<std::ofstream> out = std::nullopt;
if (output_path) {
out = std::ofstream(*output_path);
// Do necessary checks
}The second one may be cleaner if you're inlining it here. The first one could be a better option if you decide to extract it to a separate function. I think it's ok to leave it here and use the second approach.
| output_file.open(*output_path); | ||
| if (!output_file.is_open()) { | ||
| const auto err_msg = fmt::format("Failed to open output file: {}.", *output_path); | ||
| spdlog::error(err_msg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rather than printing this error log here, we should make sure that main() function catches any exception and prints the error message.
| output_file << "algorithm\tqid"; | ||
| for (size_t i = 1; i <= runs; ++i) { | ||
| output_file << fmt::format("\tusec{}", i); | ||
| } | ||
| output_file << "\n"; | ||
|
|
||
| spdlog::info("Per-run query output will be saved to '{}'.", *output_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we can do this when resolving the file stream, I think it's better to do it below:
if (output) {
spdlog::info("Per-run query output will be saved to '{}'.", *output_path);
// print header
}|
Thank you for your review @elshize, I'll work on that. |
@elshize, I tested this with |
The thing is that if we use JSONL, then we readability is worse because everything is crammed to a line. But then again, if someone requires pretty-printing, they can also just pipe it to I'll leave it up to you, I'm ok with pretty printed JSONs, JSONL, or with a single JSON with a list, and I have no strong feelings towards any of these. |
Key changes in this pull request:
--extractoption by--output, which now requires an explicit output file. Since more than one algorithm (query type) could be specified, the algorithm is now also printed in the TSV.op_perftest()behavior remains available when--outputis not specified.--runsoption to specify the number of runs to measure the query set (by default: 3). Note that this parameter excludes warmup.none,min,mean,medianandmaxas aggregation types.op_perftest(), so the set of queries is evaluated independently in each run.