[store] Unique reports _before_ storing#4152
Merged
bruntib merged 4 commits intoEricsson:masterfrom Apr 2, 2024
Merged
Conversation
We used to zip every result file in the result directory and send it to the server. This is tremendously wasteful, as many report files contained the very same reports (for instance, if the report originated from a header file). In this patch, I unique all reports, dump them in a tmpfile, and store that to the server. For a clang-tidy --enable-all analysis on xerces, this reduced the size of the zipfile drastically: BEFORE: Compressing report zip file done (1.6GiB / 62.2MiB). AFTER: Compressing report zip file done (372.6MiB / 7.9MiB). While this doesn't speed up CodeChecker parse or CodeChecker diff, it still speeds up the server quite a bit, both for storage and query times.
4c3bb47 to
3ba2130
Compare
bruntib
requested changes
Mar 12, 2024
tools/report-converter/codechecker_report_converter/report/hash.py
Outdated
Show resolved
Hide resolved
| analyzer_result_files)): | ||
| analyzer_result_files))): | ||
| if idx % 10 == 0: | ||
| LOG.debug(f"Parsed {idx}/{len(analyzer_result_files)} files...") |
Contributor
There was a problem hiding this comment.
I think, we should leave debug logs in the code only if they are valuable when asking them from the users for some debugging session.
Contributor
Author
There was a problem hiding this comment.
I changed the output to print the name of the file as it is parsed. I found this helpful during development, and I don't think its too verbose.
[DEBUG] store.py:458 assemble_zip() - Processing 3 report files ...
[DEBUG] store.py:412 parse_analyzer_result_files() - [0/3] Parsed '<path>/prune_paths.cpp_clangsa_e17b9269d5755586c9edd407de32f1df.plist' ...
[DEBUG] store.py:412 parse_analyzer_result_files() - [1/3] Parsed '<path>/prune_paths.cpp_clang-tidy_e17b9269d5755586c9edd407de32f1df.plist' ...
[DEBUG] store.py:412 parse_analyzer_result_files() - [2/3] Parsed '<path>/prune_paths.cpp_cppcheck_e17b9269d5755586c9edd407de32f1df.plist' ...
bruntib
reviewed
Mar 27, 2024
tools/report-converter/codechecker_report_converter/report/parser/plist.py
Show resolved
Hide resolved
bruntib
approved these changes
Apr 2, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We used to zip every result file in the result directory and send it to the server. This is tremendously wasteful, as many report files contained the very same reports (for instance, if the report originated from a header file).
In this patch, I unique all reports, dump them in a tmpfile, and store that to the server. For a clang-tidy --enable-all analysis on xerces, this reduced the size of the zipfile drastically:
BEFORE:
AFTER:
While this doesn't speed up CodeChecker parse or CodeChecker diff, it still speeds up the server quite a bit, both for storage and query times.
I also did a little cleanup on the debug logs -- the one in report converter spit out a line for literally every report, that feels a little unnecessary. I also added some logs to show that the parsing of result files is in progress, though that could be a little more verbose.
edit:
In addition, I needed to fix plist parsing at one point, and also, I had a looooot of trouble with
AnalysisInfonot being set for each report. The problem was in this nasty ass loop:Now, since we support multiple report dir storages at once, in the zipped file, each of those report dirs get their own report dir. However, do you also not see the clear implementation of that in the form of "for each input dir, create an output dir"? That nasty
os.path.dirnamepart does all that, and my initial implementation didn't set the dirname right. Since this manifested in errors seemingly unrelated parts of the code, I was lead on a month-long chase to find out that this was the issue.Also, since we handle all input report dirs all at once in a rather cryptic manner, I also chose to just brute-force my implementation into the code, instead of refactoring it into an easy-to-digest "for each dir in input dir, zip dir". That might be worth a followup refactoring patch.