Continuous Benchmarking

As already mentioned in multiple issues and over email/slack, we need automated tests that will be able to track performance regression.
This issue is meant to define scope.

Related useful project is planned in [conbench](https://github.com/conbench/conbench). Once it will be working, I think we should use it. Unfortunately it does not seem to happen anytime soon, or even in a more distant future.
Anyway, keeping scope minimal should make it easier to eventually move to _conbench_ later on.
Another related work is my old project [macrobenchmarking](https://gitlab.com/jangorecki/macrobenchmarking).
And recent draft PR #4517.

----

## Scope

#### Dimensions by which we will track timings

- environment (allow to lookup hardware configuration)
- R version
- git sha of data.table (lookup date and version)
- benchmark script (probably fixed to `benchmark.Rraw`)
- query
- version of a query (in case we modify existing query for some reason)
- description

#### Dimensions that for now I propose to not include in scope

- operating system
- linux kernal version
- compiler
- compiler version
- R compilation flags
- data.table compilation flags
- metrics (memory usage, etc.)
- `datatable.optimize` option
- [ ] number of threads (?)
- [ ] multiple runs of single query (?)

----

## Challenges

#### Store timings

In current infrastructre we do not have any processes that appends artifacts (timings in context of CB). Each CB run has to store results somewhere and re-use them later on.

- [ ] timings storage in csv for simplicity (?)

#### Signalling a regression

- [ ] Should we compare only to the previous timings, or to an average timings from longer period (last month, etc.)?
- [ ] What is the tolerance threshold? for cheap queries 5-25% variance will be common.
- [ ] In case of exceeding threshold  we may want to run benchmark few times and take an average to ensure before signalling regression. Then we need a second threshold for such average.

#### Environment

To reduce number of false regression signals we need to use private dedicated infrastructure.
Having dedicated machine may not be feasible, so we need to have a mechanism of signalling to jenkins (or other orchestration process) that particular machine is in use in an exclusive mode.
- [ ] We may use the same machine that runs [db-benchmark](https://github.com/h2oai/db-benchmark)

#### Pipeline

In the most likely case of not having a dedicated machine, CB may ended up being queued for a longer while (up to multiple days). Therefore it make sense to have it in a separate pipeline rather than in our data.table GLCI. Such CB pipeline could be scheduled to run daily or weekly instead of running on each commit.
- [ ] We could eventually move publishing artifacts (package, website, docker images) to dedicated daily pipeline. This is not strictly related to CB but would make it easier to publish CB results as well.

#### Versioning

- [ ] Should all elements of CB be included as a part of `data.table` project? or a separate project
  - test script `inst/tests/benchmark.Rraw`
  - new function `benchmark()` that meant to be used like `test()`, and `benchmark.data.table()` to be used like `test.data.table()`
  - any extra orchestration could be in `.ci/`

----

## Example test cases

- `[[` on a list column by group #4646
- joining in a loop, order of different types #3928
- simple access `DT[10L]`, `DT[, 3L]` #3735
- use of `.SD` for many columns #3797
- calling `setDT` in a loop #4476
- multithreaded function calls by group `DT[, uniqueN(a), by=b]`, should stress new throttle feature #4484
- more cases in existing [benchmark.Rraw](https://github.com/Rdatatable/data.table/blob/b417000b570133d1d014a4745163142d0c68e195/inst/tests/benchmark.Rraw) file


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous Benchmarking #4687

Scope

Dimensions by which we will track timings

Dimensions that for now I propose to not include in scope

Challenges

Store timings

Signalling a regression

Environment

Pipeline

Versioning

Example test cases

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Continuous Benchmarking #4687

Description

Scope

Dimensions by which we will track timings

Dimensions that for now I propose to not include in scope

Challenges

Store timings

Signalling a regression

Environment

Pipeline

Versioning

Example test cases

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions