You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is an informal review conducted by a lab member. To ensure maximal objectivity, the rOpenSci review template is used. This template also guarantees that this package is following the most up-to-date and strictest standards available in the R community.
The template is released under CC-BY-NC-SA and this review is therefore published under the same license.
The review was finished on 2021-07-27 and concerns the version 0.1.7.2 of scoringutils (commit de45fb7).
Package Review
Documentation
The package includes all the following forms of documentation:
☒ A statement of need clearly stating problems the software is designed to solve and its target audience in README
The scoringutils package provides a collection of metrics and proper scoring rules that make it simple to score forecasts against the true observed values.
Installation instructions: for the development version of package and any non-standard dependencies in README
Vignette(s) demonstrating major functionality that runs successfully locally
Multiple plots are impossible to read in the CRAN-rendered version of the vignette. The plot dimensions should be adjusted.
I would refrain from using dispensable non-base packages in the package as it introduces extra cognitive load for users. In addition to learning how to use scoringutils, they now possibly have to understand how to use whichever non-base package you use. (PR Use fct() to enable auto-linking on pkgdown #119)
I recommended using the fct() when talking about functions. It is then easier to make the difference between functions and other objects and it enables auto-linking to the function documentation in the pkgdown website. (PR Use fct() to enable auto-linking on pkgdown #119)
Function Documentation: for all exported functions
Good job writing what is the default argument value for many functions!
More functions should inherit documentation chunks to reduce risks of errors / outdated docs when updating
It’s seem strange that eval_forecasts_quantile() doesn’t have the same treatment as, e.g., eval_forecasts_binary() and get some basic documentation.
There is some inconsistency in formatting when using mentions such as ‘only required if plot = TRUE’. Sometimes, it reads ‘only required if plot == TRUE’, but not always. I recommend switching all these statements using the phrasing ’if plot = TRUE, which seems more common in the R community.
There is a minor issue with the equation rendering in the pkgdown website (e.g., https://epiforecasts.io/scoringutils/reference/abs_error.html). The solution is probably to pass both a LaTeX/mathjax and a ASCII version of the equation to \deqn{}.
The metrics argument should point to the list_of_avail_metrics() function so users know what their choices are.
There might be a typo in the documentation of the summarised argument in eval_forecasts(). Do you mean summarise_by instead of group_by?
#' @param summarised Summarise arguments (i.e. take the mean per group
#' specified in group_by. Default is TRUE.
I do not really understand the documentation of the test_options argument in compare_two_models(). It reads: > list with options to pass down to compare_two_models
But we are already in the documentation of compare_two_models and I don’t see any more documentation of this argument and which values are possible.
I don’t understand the documentation of the rel_skill_metric argument in eval_forecasts(). It reads:
If equal to ‘auto’ (the default), then one of interval score, crps or brier score will be used where appropriate
What do you mean by ‘where appropriate’, how do you choose which one these indices is used.
Aren’t some metrics missing from the @details in eval_forecasts(). For example, where is the coverage?
Examples (that run successfully locally) for all exported functions
Some functions are missing examples:
compare_two_models()
pairwise_comparison_one_group()
quantile_to_long()
quantile_to_wide()
range_to_quantile()
quantile_to_range()
sample_to_range()
merge_pred_and_obs()
hist_PIT()
hist_PIT_quantile()
Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).
There is no CONTRIBUTING guide. Adding one could signal more explicitly that external contributions are welcome.
[ x] The link to the pkgdown website should be added in DESCRIPTION (alongside the URL to the GitHub repo). This helps users looking for documentation but it is also necessary to enable automatic linking from other pkgdown websites (done in PR Add pkgdown website to DESCRIPTION #118).
Functionality
Installation: Installation succeeds as documented.
Functionality: Any functional claims of the software been confirmed.
I did not verify the package works for R versions as old as 2.10 but upon cursory glance, I could find use of any of the “newer” base R functions (listed in https://github.com/r-lib/backports).
Performance: Any performance claims of the software been confirmed.
Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
Regarding continuous integration, the package uses the current best practices in the community: GitHub actions with tests on various OS and R versions.
However, testing is the weakest point of this package. The coverage at the time of writing this review is at 48.9469862, when a package of this type should aim for around 90%% (excluding plotting functions that are usually more difficult to integrate in tests). Additionally, this value is somewhat inflated by ‘weak’ tests that only test if the function is producing an output but does not check this output. A good practice for a package of this type would be to compare its output to ‘known true values’ from the literature. Ideally, papers where these indices are defined.
Packaging guidelines: The package conforms to the rOpenSci packaging guidelines
In terms of interface and what functionality is exposed to users, I think this package does exactly the right thing by providing access to both the low-level individual scores and to the higher-level eval_forecasts().
There are some inconsistencies in naming. I mostly noticed this while creating the pkgdown reference index (PR Create pkgdown reference index #113). With just the function/object names, it was difficult to infer what each function would return. I would recommend using the prefix example_ for all example data objects. It is present in all names currently but at a different position in many times. Similarly, it would be helpful to have the plotting functions start with plot_ or something of the sort.
It would be useful to tag each release on GitHub as some users might rely on this to get informed of a new release. It is made easier with the usethis::use_github_release() function.
The README (Readme.md) and the CHANGELOG (NEWS.md) should be added to the package tarball so there are available with the package on CRAN. This is done by removing these files from .Rbuildignore (done in Remove NEWS.md and Readme.md from .Rbuildignore #117)
It should be easy to remove the dependency on forcats and I would be inclined to remove it as it introduces an extra potential source of breakage for a very minimal gain (as opposed to dependency on data.table or ggplot2). But since all the recursive dependencies of forcats are also dependencies of ggplot2, I would understand it the lead developer decides it’s not worth the trouble.
It is generally recommended to include a section ‘Similar projects’ in the README where you can highlight the differences and strengths of this tool against the existing alternatives. In this specific case, I would be interested in hearing about the differences with scoringRules which is one of the dependencies.
Estimated hours spent reviewing: 13h
Review Comments / Code Review
I noticed some occurrences of bool == TRUE which doesn’t really make sense and makes the code more difficult to read. If the object is already a logical, you can use it as-is in if. For example:
if (all.equal(as.vector(predictions), as.integer(predictions)) !=TRUE) {
Good job on specifying the type of your NA (NA_real_, NA_integer_, etc.)!
I think there is some inconsistency in type / value checking before computation. For example, in brier_score(), there is a check that predictions takes values between 0 and 1 but this check is not present in, e.g., bias(). Would it make sense to have check_truth(), check_predictions() internal functions that you call each time?
Would it be worth printing a warning when the user requests ‘coverage’ in eval_forecasts_sample() instead of silently dropping it?
As mentioned in another discussion, there is some inconsistency in the use of data.table and modifying in place vs copying. Beyond the stylistic issue, this is a possible source of bugs so I’d recommend sticking to one or the other.
Some functions have a very large number of arguments which makes them difficult to use. Research in software engineering tends to suggests to the number of arguments should not exceed ~7. As the function is complex, it may be difficult to reduce the number of arguments but here are possible some ways:
drop the verbose argument. Either the diagnostic messages are unnecessary and should be dropped entirely or they are useful and should be printed. If users really don’t want to see messages/warnings, they can use base functions suppressMessages() / suppressWarnings(). This could also be controlled by a global switch in options() like usethis is doing.
plotting side effects could be removed. The primary goal of eval_forecasts() is to return a data.frame with the scores. Users that want the plot could call another function afterwards. This would allow the removal of the pit_plots argument.
remove the possibility of having either a single data or forecasts, truth_data and merge_by as inputs.
I understand the wish to provide flexibility but since eval_forecasts() is meant to be a high-level wrapper / one-liner to compute everything, I believe it’s okay to provide a limited interface. Users that strive for more flexibility can always use the low-level individual scoring functions.
I think there is a misplaced closing parenthesis here and you mean
If this is indeed a bug, its presence twice in the same file suggests this code portion should be refactored as a function (isn’t that the purpose of delete_columns() already?). By the way, why is it > 1 in one case and > 0 in the other?
I cannot say for sure because as mentioned previously, I don’t understand the documentation of the test_options argument in compare_two_models() but this selection of the first element ([1]) does not seem super robust:
There are some occurrence where loops (vapply()) are used when you could rely on vectorized functions / linear algebra for much faster computation and a more readable code (fixed in Remove unnecessary vapply() #120):
Currently, a very minor and inconspicuous typo such as "facet_warp" would make it silently switch to facet_grid and it would be very difficult to notice the mistake.
All functions in scoringRules_wrappers.R seem to have the same checks at the beginning. It would be less error-prone to refactor this.
This is not a robust way to get the value of an argument with defaults:
This is overall a solid package that could become a widely used tool in forecast sciences. I could not see any bugs in the code and the performance looks very good on the examples I ran. The package interface is clever and can surely prove useful to a large array of users thanks to the two levels of functions (low-level scoring functions vs all-in-one eval_forecasts()).
Two points could slow down / reduce adoption and these should be fixed for this package to reach its full potential and attract as many users as possible:
the package remains complex to use. This complexity is in part inherent to the task but it could nonetheless be reduced by following best practices in software engineering such as reducing the number of parameters and adopting a consistent naming scheme.
there is no strong evidence that this package implements correctly the computed metrics. This is especially important for fields that can have a policy impact. Test coverage should be increased and comparisons to computation via other tools / methods should be added.
Preface
This is an informal review conducted by a lab member. To ensure maximal objectivity, the rOpenSci review template is used. This template also guarantees that this package is following the most up-to-date and strictest standards available in the R community.
The template is released under CC-BY-NC-SA and this review is therefore published under the same license.
The review was finished on 2021-07-27 and concerns the version 0.1.7.2 of scoringutils (commit de45fb7).
Package Review
Documentation
The package includes all the following forms of documentation:
Installation instructions: for the development version of package and any non-standard dependencies in README
Vignette(s) demonstrating major functionality that runs successfully locally
fct()when talking about functions. It is then easier to make the difference between functions and other objects and it enables auto-linking to the function documentation in the pkgdown website. (PR Use fct() to enable auto-linking on pkgdown #119)Function Documentation: for all exported functions
eval_forecasts_quantile()doesn’t have the same treatment as, e.g.,eval_forecasts_binary()and get some basic documentation.plot = TRUE’. Sometimes, it reads ‘only required ifplot == TRUE’, but not always. I recommend switching all these statements using the phrasing ’ifplot = TRUE, which seems more common in the R community.\deqn{}.metricsargument should point to thelist_of_avail_metrics()function so users know what their choices are.summarisedargument ineval_forecasts(). Do you meansummarise_byinstead ofgroup_by?scoringutils/R/eval_forecasts.R
Lines 109 to 110 in de45fb7
test_optionsargument incompare_two_models(). It reads: > list with options to pass down tocompare_two_modelsBut we are already in the documentation of compare_two_models and I don’t see any more documentation of this argument and which values are possible.
rel_skill_metricargument ineval_forecasts(). It reads:What do you mean by ‘where appropriate’, how do you choose which one these indices is used.
@detailsineval_forecasts(). For example, where is the coverage?Examples (that run successfully locally) for all exported functions
Some functions are missing examples:
compare_two_models()pairwise_comparison_one_group()quantile_to_long()quantile_to_wide()range_to_quantile()quantile_to_range()sample_to_range()merge_pred_and_obs()hist_PIT()hist_PIT_quantile()Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with
URL,BugReportsandMaintainer(which may be autogenerated viaAuthors@R).DESCRIPTION(alongside the URL to the GitHub repo). This helps users looking for documentation but it is also necessary to enable automatic linking from other pkgdown websites (done in PR Add pkgdown website to DESCRIPTION #118).Functionality
Installation: Installation succeeds as documented.
Functionality: Any functional claims of the software been confirmed.
Performance: Any performance claims of the software been confirmed.Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
Regarding continuous integration, the package uses the current best practices in the community: GitHub actions with tests on various OS and R versions.
However, testing is the weakest point of this package. The coverage at the time of writing this review is at 48.9469862, when a package of this type should aim for around 90%% (excluding plotting functions that are usually more difficult to integrate in tests). Additionally, this value is somewhat inflated by ‘weak’ tests that only test if the function is producing an output but does not check this output. A good practice for a package of this type would be to compare its output to ‘known true values’ from the literature. Ideally, papers where these indices are defined.
Packaging guidelines: The package conforms to the rOpenSci packaging guidelines
In terms of interface and what functionality is exposed to users, I think this package does exactly the right thing by providing access to both the low-level individual scores and to the higher-level
eval_forecasts().There are some inconsistencies in naming. I mostly noticed this while creating the pkgdown reference index (PR Create pkgdown reference index #113). With just the function/object names, it was difficult to infer what each function would return. I would recommend using the prefix
example_for all example data objects. It is present in all names currently but at a different position in many times. Similarly, it would be helpful to have the plotting functions start withplot_or something of the sort.It would be useful to tag each release on GitHub as some users might rely on this to get informed of a new release. It is made easier with the
usethis::use_github_release()function.News items do not appear for some versions on https://epiforecasts.io/scoringutils/news/index.html. This is because the version number and the subsection have the same heading level. I submitted a fix in Fix heading levels in NEWS.md #116.
The README (
Readme.md) and the CHANGELOG (NEWS.md) should be added to the package tarball so there are available with the package on CRAN. This is done by removing these files from.Rbuildignore(done in Remove NEWS.md and Readme.md from .Rbuildignore #117)It should be easy to remove the dependency on forcats and I would be inclined to remove it as it introduces an extra potential source of breakage for a very minimal gain (as opposed to dependency on data.table or ggplot2). But since all the recursive dependencies of forcats are also dependencies of ggplot2, I would understand it the lead developer decides it’s not worth the trouble.
It is generally recommended to include a section ‘Similar projects’ in the README where you can highlight the differences and strengths of this tool against the existing alternatives. In this specific case, I would be interested in hearing about the differences with scoringRules which is one of the dependencies.
Estimated hours spent reviewing: 13h
Review Comments / Code Review
bool == TRUEwhich doesn’t really make sense and makes the code more difficult to read. If the object is already a logical, you can use it as-is inif. For example:scoringutils/R/bias.R
Lines 86 to 90 in de45fb7
could be simplified as
Some other occurrences:
scoringutils/R/eval_forecasts_continuous_integer.R
Line 52 in de45fb7
scoringutils/R/pit.R
Line 157 in de45fb7
Good job on specifying the type of your
NA(NA_real_,NA_integer_, etc.)!I think there is some inconsistency in type / value checking before computation. For example, in
brier_score(), there is a check thatpredictionstakes values between 0 and 1 but this check is not present in, e.g.,bias(). Would it make sense to havecheck_truth(),check_predictions()internal functions that you call each time?Would it be worth printing a warning when the user requests ‘coverage’ in
eval_forecasts_sample()instead of silently dropping it?As mentioned in another discussion, there is some inconsistency in the use of data.table and modifying in place vs copying. Beyond the stylistic issue, this is a possible source of bugs so I’d recommend sticking to one or the other.
Some functions have a very large number of arguments which makes them difficult to use. Research in software engineering tends to suggests to the number of arguments should not exceed ~7. As the function is complex, it may be difficult to reduce the number of arguments but here are possible some ways:
verboseargument. Either the diagnostic messages are unnecessary and should be dropped entirely or they are useful and should be printed. If users really don’t want to see messages/warnings, they can use base functionssuppressMessages()/suppressWarnings(). This could also be controlled by a global switch inoptions()like usethis is doing.eval_forecasts()is to return a data.frame with the scores. Users that want the plot could call another function afterwards. This would allow the removal of thepit_plotsargument.dataorforecasts,truth_dataandmerge_byas inputs.summarisedand act as ifsummarised = TRUEifby != summarise_by(is this Check whether we can get rid of the summarised = TRUE argument? #106?)I understand the wish to provide flexibility but since
eval_forecasts()is meant to be a high-level wrapper / one-liner to compute everything, I believe it’s okay to provide a limited interface. Users that strive for more flexibility can always use the low-level individual scoring functions.I think there is a misplaced closing parenthesis here and you mean
instead of the current:
scoringutils/R/pairwise-comparisons.R
Line 165 in de45fb7
Same here:
scoringutils/R/pairwise-comparisons.R
Line 220 in de45fb7
If this is indeed a bug, its presence twice in the same file suggests this code portion should be refactored as a function (isn’t that the purpose of
delete_columns()already?). By the way, why is it> 1in one case and> 0in the other?I cannot say for sure because as mentioned previously, I don’t understand the documentation of the
test_optionsargument incompare_two_models()but this selection of the first element ([1]) does not seem super robust:scoringutils/R/pairwise-comparisons.R
Line 373 in de45fb7
There are some occurrence where loops (
vapply()) are used when you could rely on vectorized functions / linear algebra for much faster computation and a more readable code (fixed in Remove unnecessary vapply() #120):[ ]
scoringutils/R/bias.R
Lines 95 to 99 in de45fb7
[ ]
scoringutils/R/bias.R
Lines 106 to 110 in de45fb7
[ ]
scoringutils/R/pit.R
Lines 169 to 173 in de45fb7
[ ]
scoringutils/R/pit.R
Lines 193 to 197 in de45fb7
In ggplot2 plots with the
facet_wrap_or_gridargument, I would change the default value toc("facet_wrap", "facet_grid")and start the function with:Currently, a very minor and inconspicuous typo such as
"facet_warp"would make it silently switch tofacet_gridand it would be very difficult to notice the mistake.All functions in
scoringRules_wrappers.Rseem to have the same checks at the beginning. It would be less error-prone to refactor this.This is not a robust way to get the value of an argument with defaults:
scoringutils/R/utils.R
Line 208 in de45fb7
scoringutils/R/utils.R
Line 220 in de45fb7
scoringutils/R/utils_data_handling.R
Lines 422 to 430 in de45fb7
Instead, you should use:
Deprecated functions from
utils_data_handling.Rshould be categorised as such in the pkgdown reference index.There is no unit testing here since this is not a testthat function:
scoringutils/tests/testthat/test-bias.R
Line 48 in de45fb7
Conclusion
This is overall a solid package that could become a widely used tool in forecast sciences. I could not see any bugs in the code and the performance looks very good on the examples I ran. The package interface is clever and can surely prove useful to a large array of users thanks to the two levels of functions (low-level scoring functions vs all-in-one
eval_forecasts()).Two points could slow down / reduce adoption and these should be fixed for this package to reach its full potential and attract as many users as possible:
the package remains complex to use. This complexity is in part inherent to the task but it could nonetheless be reduced by following best practices in software engineering such as reducing the number of parameters and adopting a consistent naming scheme.
there is no strong evidence that this package implements correctly the computed metrics. This is especially important for fields that can have a policy impact. Test coverage should be increased and comparisons to computation via other tools / methods should be added.