[BEAM-11791] Adding a microbenchmark for TestStream #14046

pabloem · 2021-02-23T21:33:52Z

Adding a microbenchmark for running a simple pipeline with TestStream and a few groupings in parallel.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	Dataflow	Samza	Twister2
Go	---	---	---
Java
Python		---	---
XLang		---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website	Whitespace	Typescript
Non-portable
Portable	---		---	---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

pabloem · 2021-02-23T21:51:10Z

Run Python Load Tests FnApiRunner Microbenchmark

codecov · 2021-02-23T21:57:34Z

Codecov Report

Merging #14046 (5f8a15f) into master (aaad864) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #14046      +/-   ##
==========================================
- Coverage   82.98%   82.97%   -0.01%     
==========================================
  Files         469      469              
  Lines       58298    58300       +2     
==========================================
- Hits        48379    48375       -4     
- Misses       9919     9925       +6

Impacted Files	Coverage Δ
...amples/snippets/transforms/elementwise/__init__.py
sdks/python/apache_beam/examples/snippets/util.py
sdks/python/apache_beam/io/filebasedsource.py
...am/portability/api/standard_window_fns_pb2_urns.py
...beam/testing/load_tests/load_test_metrics_utils.py
...he_beam/portability/api/external_transforms_pb2.py
sdks/python/apache_beam/coders/typecoders.py
sdks/python/apache_beam/examples/sql_taxi.py
...ks/python/apache_beam/runners/worker/statecache.py
..._beam/testing/benchmarks/nexmark/queries/query0.py
... and 928 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aaad864...5f8a15f. Read the comment docs.

pabloem · 2021-02-23T22:11:06Z

Successful run: https://ci-beam.apache.org/job/beam_Python_LoadTests_FnApiRunner_Microbenchmark_PR/5/console

pabloem · 2021-02-23T22:22:39Z

Run Python PreCommit

rohdesamuel · 2021-02-24T18:39:47Z

sdks/python/apache_beam/tools/teststream_microbenchmark.py

+def run_benchmark(
+    starting_point=1, num_runs=10, num_elements_step=300, verbose=True):
+  suite = [
+      utils.LinearRegressionBenchmarkConfig(


I'm confused, why is this running a linear regression? Shouldn't it just be a windowed average over the last N runs?

This runs a pipeline with 1 element, then 300, then 600, and so on until 3000. We use the linear regression to measure per-element overhead, and base overhead of the runner, and report it.

Still feels very weird to me to be using a regression. Usually those are meant to build models/feature extraction, but this seems the benchmark only needs to report some windowed average. Anyway, that's a conversation for another time.

rohdesamuel · 2021-02-24T18:41:02Z

sdks/python/apache_beam/tools/teststream_microbenchmark.py

+3) When the timer fires, change the key and output all the elements downstream
+
+This executes the same codepaths that are run on the Fn API (and Dataflow)
+workers, but is generally easier to run (locally) and more stable..


Remove extraneous period

rohdesamuel · 2021-02-24T18:49:38Z

sdks/python/apache_beam/tools/teststream_microbenchmark.py

+1) Put all the PCollection elements in state
+2) Set a timer for the future
+3) When the timer fires, change the key and output all the elements downstream


This comment is somewhat misleading. Technically, windows are state & timers but I don't know if many people will make that connection reading this code. Can you please rewrite what this microbenchmark does in terms of windowing?

Thanks. Great catch. I've rephrased this. LMK if htat looks better.

rohdesamuel

lgtm

pabloem · 2021-02-24T20:07:29Z

thanks Sam!

…gger * github/master: (123 commits) [BEAM-11899] Bump commons-pool to 2.8.1 and bump commons-dbcp to 2.8.0, Because there is a library dependency Update pillars.yaml (apache#14142) [BEAM-10632] Checkerframework nullness cleanups (apache#14107) [BEAM-11213] Instantiate SparkListenerApplicationStart in a Spark 3 compatible way Fix typos for excluding testMergingCustomWindowsWithoutCustomWindowTypes Specify the time resolution for TestStreamPayload. [BEAM-10961] Enable strict depdency checking for sdks/java/extensions/euphoria (second attempt) [BEAM-11848] Store Docker images in a variable for consistency. Splitting old Go Precommit and new ULR integration test precommit. Moving runner imports out of ptest. Add the TO_STRING capability to Java and Python [BEAM-11848] Fix Docker images list. jdbc python supported Dataflow runner (apache#13960) Adding a warning to use multi-workers on FnApiRunner Fix legend for Python Directrunner microbenchmarks [BEAM-11740] Estimate PCollection byte size [BEAM-10961] enable strict dependency checking for sdks/java/extensions/zetasketch (apache#14093) Map Dataflow JOB_STATE_CANCELLING to Beam RUNNING state [BEAM-11833] Fix reported watermark after restriction split in UnboundedSourceAsSDFRestrictionTracker [BEAM-10761] add reference to BEAM-11761 [BEAM-10961] enable strict dependency checking for flink/job-server Exclude MapState example integration tests from Dataflow runner v2 suite Remove InvalidWindows from Java SDK, instead track "already merged" bit Fix checkstyle in watermark latency benchmark Fix compile breakage in WindmillStateInternals Improve test, error on ALREADY_MERGED. [BEAM-10961] Strict dependency checking for sdks/java/io/gcp (apache#13791) Initial watermark latency benchmark Attempting improvements on DirectRunner Python dash [BEAM-10961] enable strict dependency checking for sdks/java/extensions/google-cloud-platform-core (apache#14084) Merge pull request apache#13802: [BEAM-1474]. Adding MapState and SetState support for the Dataflow runner Remove some false positives Remove nullness warning suppression [BEAM-11861] Add methods to explicitly provide coder for ParquetIO's Parse and ParseFiles (apache#14078) [BEAM-11531] Use pandas 1.2 for python>=3.7 (apache#14099) [BEAM-10961] add reference to BEAM-11761 [BEAM-10961] add explicit compile for auto_value_annotations in sdks/extensions/ml/build.gradle Attempting improvements on DirectRunner Python dash Recognize JOB_STATE_PENDING from Dataflow and map to RUNNING never run checkerframework on tests Puts more expensive BQ empty table check to the right of the 'and' condition (apache#14094) Use the windowing strategy of the input, not output, PCollection of GBK. Do not stage dataflow worker jar when use runner_v2. [BEAM-11870] Re-raise underlying exception for InvocationTargetException (apache#14098) [BEAM-11778] Create a wrapper for ZetaSQL catalog and refactor accordingly. (apache#13934) [BEAM-9378] Add ignored tests which fail in various ways when querying nested structures (apache#14077) Merge Fn API and runner v2 configurations for DataflowRunner Fix up! formatting Add validate runner test for testing custom merging windows fn without custom window types Revert "Revert "[BEAM-2914] Add portable merging window support to Python. (apache#12995)"" [BEAM-10961] fix stray reordering of lines [BEAM-10961] enable strict dependency checking for sdks/java/extensions/sorter [BEAM-10961] enable strict dependency checking for sdks/java/extensions/sketching [BEAM-10961] enable strict dependency checking for sdks/java/extensions/schemaio-expansion-service [BEAM-10961] enable strict dependency checking for sdks/java/extensions/protobuf [BEAM-10961] enable strict dependency checking for sdks/java/extensions/ml [BEAM-10961] enable strict dependency checking for sdks/java/extensions/kyro [BEAM-10961] enable strict dependency checking for sdks/java/extensions/join-library [BEAM-10961] undo line moves (originally intended for alphabeticization) [BEAM-10961] enable strict dependency checking for sdks/java/extensions/jackson [BEAM-10961] Enable strict dependency checking on sdks/java/extensions/sql (apache#13830) [BEAM-10961] enable strict dependency checking for sdks/java/extensions/euphoria [BEAM-10961] enable strict dependency checking for sdks/java/io/parquet (apache#14062) [BEAM-10961] enable strict dependency checking for sdks/java/io/thrift (apache#14066) Refactor ZetaSqlDialectSpecTest and add some passing tests. (apache#14080) [BEAM-11864] Use objects.equals instead of raw comparison [BEAM-11707] Change WindmillStateCache cache invalidation to be based upon reference invalidation instead of expensive set management. Reduce operations of shared cache by caching per-key object sets locally and flushing as groups to shared cache. Remove byte tracking which could be racy based upon background evictions in favor of just iterating for rendering the status page. This also lets us capture more stats. [BEAM-11730] Reduce context switching overhead for appliance reads by issuing reads directly from calling threads in the case that there is no reads being queued. Fix preview Show string from Dataflow service when job terminates in unrecognized state Log a warning when Dataflow returns an unrecognized state Merge pull request apache#14033 from [BEAM-11408] Integrate Python BigQuery sink with GroupIntoBatches Remove SYNCHRONIZED_PROCESSING_TIME from model proto Remove use of model SYNCHRONIZED_PROCESSING_TIME Merge redundant model feature columns in capability matrix Remove MapReduce runner from capability matrix, because it is on a branch and unreleased Remove JStorm runner from capability matrix, because it is on a branch and unreleased Remove retractions from capability matrix, because they do not exist yet Remove metadata-driven triggers from capability matrix, because they do not exist [BEAM-10937] Add Tour of Beam page (apache#13747) [BEAM-11344] Apply "Become a Committer" changes from Website Revamp (apache#14036) Merge pull request apache#14046 from [BEAM-11791] Adding a microbenchmark for TestStream Returning successful writes in FhirIO.Write.Result (apache#14034) Fixup [BEAM-10961] enable strict dependency checking for sdks/java/io/file-based-io-tests (apache#14052) [BEAM-10961] enable strict dependency checking for sdks/java/io/contextualtextio (apache#14049) [BEAM-10961] enable strict dependency checking for sdks/java/io/kinesis (apache#14058) [BEAM-10961] enable strict dependency checking for sdks/java/io/bigquery-io-perf-tests (apache#14048) [BEAM-10961] enable strict dependency checking for sdks/java/io/elasticsearch (apache#14050) [BEAM-10961] enable strict dependency checking for sdks/java/io/expansion-service (apache#14051) [BEAM-10961] enable strict dependency checking for sdks/java/io/jdbc (apache#14055) [BEAM-10961] enable strict dependency checking for sdks/java/io/jms (apache#14056) [BEAM-10961] enable strict dependency checking for sdks/java/io/kafka (apache#14057) [BEAM-10961] enable strict dependency checking for sdks/java/io/hcatalog (apache#14053) [BEAM-11859] Fixed bug in python S3 IO [BEAM-10114] Fix PerSubscriptionPartitionSdf to not rely on the presence of BundleFinalizer [BEAM-10114] Fix PerSubscriptionPartitionSdf to not rely on the presence of BundleFinalizer [BEAM-10961] fix spacing [BEAM-10961] enable strict dependency checking for sdks/java/io/xml [BEAM-10961] enable strict dependency checking for sdks/java/io/tika ...

[BEAM-11791] Adding a microbenchmark for TestStream

7a6b6bb

pabloem force-pushed the teststream-ubench branch from d0c0cf4 to 7a6b6bb Compare February 23, 2021 21:40

fixup

353e5db

rohdesamuel suggested changes Feb 24, 2021

View reviewed changes

addressing comments

5f8a15f

rohdesamuel approved these changes Feb 24, 2021

View reviewed changes

pabloem merged commit f355631 into apache:master Feb 24, 2021

pabloem deleted the teststream-ubench branch February 24, 2021 20:07

[BEAM-11791] Adding a microbenchmark for TestStream #14046

[BEAM-11791] Adding a microbenchmark for TestStream #14046

Uh oh!

Conversation

pabloem commented Feb 23, 2021

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

GitHub Actions Tests Status (on master branch)

Uh oh!

pabloem commented Feb 23, 2021

Uh oh!

codecov bot commented Feb 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pabloem commented Feb 23, 2021

Uh oh!

pabloem commented Feb 23, 2021

Uh oh!

rohdesamuel Feb 24, 2021

Choose a reason for hiding this comment

Uh oh!

pabloem Feb 24, 2021

Choose a reason for hiding this comment

Uh oh!

rohdesamuel Feb 24, 2021

Choose a reason for hiding this comment

Uh oh!

rohdesamuel Feb 24, 2021

Choose a reason for hiding this comment

Uh oh!

pabloem Feb 24, 2021

Choose a reason for hiding this comment

Uh oh!

rohdesamuel Feb 24, 2021

Choose a reason for hiding this comment

Uh oh!

pabloem Feb 24, 2021

Choose a reason for hiding this comment

Uh oh!

rohdesamuel left a comment

Choose a reason for hiding this comment

Uh oh!

pabloem commented Feb 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Feb 23, 2021 •

edited

Loading