Add coders microbenchmark. #5565

tvalentyn · 2018-06-05T22:47:31Z

Add a microbenchmark to test performance of coders and a mini-framework for creating similar microbenchmarks.

Sample output:

Run 1 of 10:
Dict[int, int], FastPrimitiveCoder : per element time cost: 2.94209e-07 sec
Dict[str, int], FastPrimitiveCoder : per element time cost: 1.29011e-06 sec
List[int WV], IterableCoder+WVCoder+FPCoder: per element time cost: 7.58116e-05 sec
List[int], FastPrimitiveCoder : per element time cost: 1.17612e-07 sec
List[int], IterableCoder+FastPrimitiveCoder: per element time cost: 1.13797e-07 sec
Tuple[int], FastPrimitiveCoder : per element time cost: 1.22905e-07 sec

Run 2 of 10:
...

Median time cost:
Dict[int, int], FastPrimitiveCoder : per element median time cost: 3.12746e-07 sec
Dict[str, int], FastPrimitiveCoder : per element median time cost: 1.34674e-06 sec
List[int WV], IterableCoder+WVCoder+FPCoder: per element median time cost: 7.97846e-05 sec
List[int], FastPrimitiveCoder : per element median time cost: 1.22845e-07 sec
List[int], IterableCoder+FastPrimitiveCoder: per element median time cost: 1.22857e-07 sec
Tuple[int], FastPrimitiveCoder : per element median time cost: 1.26493e-07 sec

tvalentyn · 2018-06-05T23:36:17Z

cc: @aaltay , @robertwb , @RobbeSneyders

robertwb · 2018-06-05T23:40:39Z

sdks/python/apache_beam/tools/coders_microbenchmark.py

+  # TODO(BEAM-4441): Pick coders using type hints, for example:
+  # tuple_coder = typecoders.registry.get_coder(typehints.Tuple[int])
+
+  benchmarks.append({


There seems to be a huge amount of redundancy here, perhaps create a coder_benchmark function that takes a (name, generate_fn, coder) and produces this dict.

robertwb · 2018-06-05T23:42:42Z

sdks/python/apache_beam/tools/coders_microbenchmark.py

+      "name": "List[int], FastPrimitiveCoder",
+      "input_generator": lambda: generate_list_of_ints(input_size),
+      "op_fn": lambda input_: encode_and_decode(
+          coders.FastPrimitivesCoder(), input_),


There is possibly a huge amount of overhead here that's not about actually doing the encoding. (E.g. creating the Coder, creating the impls, calling several Python functions, etc.)

robertwb · 2018-06-05T23:44:01Z

sdks/python/apache_beam/tools/coders_microbenchmark.py

+
+
+def encode_and_decode(coder, data):
+  _ = coder.decode(coder.encode(data))


Typically encode() and decode() are not called individually. It would probably be better to encode a list of 100 or 1000 elements with the ListCoder to cut out the overheads.

robertwb · 2018-06-05T23:45:24Z

sdks/python/apache_beam/tools/utils.py

+  on a generated input, collect and print its execution times.
+
+  Args:
+    benchmarks: A list of dictionaries that describe benchmarks.


Consider using a named tuple?

Perhaps it's being too functional, but you could also let a benchmark be any callable that takes as input an input size, and returns a callable. It's name could be getattr(benchmark, 'name', str(benchmark)) and the input size would be provided externally.

One could implement benchmarks, of desired, in OO style as

class MyBenchmark(object): def __init__(self, size): [whatever initialization] def __call__(self): [run the code in question]

This would make it easy to write

def run_benchmarks([list of benchmark classes/callables], [list of sizes]):
...

Thanks for the feedback, @robertwb !
In this case it sounds like we'd have to create a new instance of MyBenchmark for every iteration. Perhaps it would be better to just go OO style as in

class Benchmark(object): def __init__(self): [whatever one-time initialization, for example instantiate coders and create an impl] def run(self): raise NotImplementedError() [run the code in question] def generate_next_input(size): raise NotImplementedError def __str__(): raise NotImplementedError # returns name of this Benchmark. Will be used as a key. def run_benchmarks([list of (benchmark object, size, number_of_iterations)] ): # Returns dict [name->(list of execution times)] ...

This way it seems that it is easier to separate one time initialization from input generation. Reusing object state may help to reduce the overheads related to benchmark execution.
What do you think?

robertwb · 2018-06-05T23:56:58Z

sdks/python/apache_beam/tools/utils.py

+      # runs.
+      gc.collect()
+      time_cost = run(b)
+      cost_series[b["name"]].append(time_cost / b["input_size"])


If we retain these separately, we can do linear regression to sort out the overhead vs. the difference due to size.

robertwb · 2018-06-07T19:45:39Z

On Thu, Jun 7, 2018 at 12:08 PM tvalentyn ***@***.***> wrote: ***@***.**** commented on this pull request. Thanks for the feedback, @robertwb <https://github.com/robertwb> ! In this case it sounds like we'd have to create a new instance of MyBenchmark for every iteration.

Correct. The class represents a set of possible benchmarks, the instance a concrete one that can be executed.

Perhaps it would be better to just go OO style as in class Benchmark(object): def __init__(self): [whatever one-time initialization, for example instantiate coders and create an impl] def run(self): raise NotImplementedError() [run the code in question] def generate_next_input(size): raise NotImplementedError def __str__(): raise NotImplementedError # returns name of this Benchmark. Will be used as a key. def run_benchmarks([list of (benchmark object, size, number_of_iterations)] ): # Returns dict [name->(list of execution times)] ... This way it seems that it is easier to separate one time initialization from input generation. Reusing object state may help to reduce the overheads related to benchmark execution. What do you think?

Hmm... I would say in this case the complexity of splitting up initialization vs. input generation (v.s. just calling all of it "initialization") isn't going to be a big savings. More importantly, however, Benchmarks now become mutable objects, and a particular configuration is an instance after a certain method has been called with a particular parameter rather than just an instance.

robertwb · 2018-06-07T19:52:45Z

A meta-comment, this seems like something that could be bike-shedded a lot. I wonder if it'd be better to focus on making the ideal framework for benchmarking coders and later generalizing if that becomes useful.

tvalentyn · 2018-06-09T03:03:52Z

Thanks @robertwb, PTAL.

robertwb

Thanks. Some more comments on the coders-specific part.

robertwb · 2018-06-11T23:15:40Z

sdks/python/apache_beam/tools/utils.py

+      time_cost = run(benchmark_config.benchmark, size)
+      cost_series[name].append(time_cost)
+      if verbose:
+        avg_cost = time_cost/size


Nit: My eyes jump form "avg_cost" to "median"cost" below, which are quite different. Maybe call it "scaled_cost"?

Thanks, renamed the variables, should be more mnemonic now.

robertwb · 2018-06-11T23:17:41Z

sdks/python/apache_beam/tools/utils.py

+  """
+
+  def get_name(benchmark_config):
+    return getattr(benchmark_config.benchmark, '__name__',


Maybe just take benchmark in, rather than benchmark_config?

robertwb · 2018-06-11T23:18:03Z

sdks/python/apache_beam/tools/utils.py

+
+  cost_series = collections.defaultdict(list)
+  for benchmark_config in benchmark_suite:
+    name = get_name(benchmark_config)


Should name incorporate the size?

Good idea, thanks.

robertwb · 2018-06-11T23:19:59Z

sdks/python/apache_beam/tools/coders_microbenchmark.py

+          generate_windowed_value_list),
+  ]
+
+  BenchmarkConfig = collections.namedtuple(


Move this to tools/utils.py (or remove).

Good idea, done.

robertwb · 2018-06-11T23:20:18Z

sdks/python/apache_beam/tools/utils.py

+
+        size: int, a size of the input. Aggregated per-element metrics
+           are counted based on the size of the input.
+    num_runs: int, number of times to run each benchmark.


The code indicates this is part of the tuple.

robertwb · 2018-06-11T23:32:54Z

sdks/python/apache_beam/tools/coders_microbenchmark.py

+
+
+def generate_list_of_ints(input_size, lower_bound=0, upper_bound=sys.maxsize):
+  values = []


List comprehension?

Done, thanks

robertwb · 2018-06-11T23:42:20Z

sdks/python/apache_beam/tools/coders_microbenchmark.py

+
+
+def generate_dict_int_int(input_size):
+  sample_list = generate_list_of_ints(input_size)


return {val: val for val in generate_list_of_ints(input_size)}

robertwb · 2018-06-11T23:42:24Z

sdks/python/apache_beam/tools/coders_microbenchmark.py

+
+def generate_dict_str_int(input_size):
+  sample_list = generate_list_of_ints(input_size)
+  return {generate_string(100): val for val in sample_list}


robertwb · 2018-06-11T23:45:14Z

sdks/python/apache_beam/tools/coders_microbenchmark.py

+          "List[int], FastPrimitiveCoder", coders.FastPrimitivesCoder(),
+          generate_list_of_ints),
+      coder_benchmark_factory(
+          "Tuple[int, int], FastPrimitiveCoder", coders.FastPrimitivesCoder(),


This is Tuple[int, ...], not 2-tuples of ints.

robertwb · 2018-06-11T23:56:37Z

sdks/python/apache_beam/tools/coders_microbenchmark.py

+    self.data_ = generate_fn(size)
+
+  def call(self):
+    _ = self.coder_.decode(self.coder_.encode(self.data_))


We care more about the speed of encoding a list of 10000 small dicts, than a single dict of 10000 entries. The former better approximates how the code will get invoked on PCollections.

E.g. you could have

def coder_benchmark(name, coder, *sample_elements): def __init__(self, size): self._coder = IterableCoder(coder) self._elements = list(itertools.islice(itertools.cycle(sample_elements), size))

Changed. Let me know if that captures your idea. Currently benchmarks use a list of 1000 collections of size 50. The suite takes around 2 min to complete on my machine. Dict[str, int] and List of WindowedValues benchmarks take 10x longer to finish than others. With 10000 collections, the execution time might be a little longer than desirable for a quick feedback loop, but we can increase it if necessary.

RobbeSneyders · 2018-06-13T06:52:14Z

I ran the benchmark based on this framework in master...tvalentyn:utils_futurization_benchmark a few times and noticed a difference of up to 12% between different runs with the same code.

Is there something we should do to run the benchmarks isolated?

tvalentyn · 2018-06-13T07:27:25Z

I did not run the benchmark in isolation, just used my machine + virtual environment. I sometimes also observed difference in performance between some iterations. Increasing the number of iterations usually helps to make median more stable, could you try that?

robertwb

Nice, thanks. Just a couple more thoughts.

robertwb · 2018-06-27T19:04:09Z

sdks/python/apache_beam/tools/utils.py

      cost_series[name].append(time_cost)
      if verbose:
-        avg_cost = time_cost/size
+        per_element_cost = time_cost/size


Nit: spaces around division operator.

robertwb · 2018-06-27T19:15:38Z

sdks/python/apache_beam/tools/coders_microbenchmark.py

-def generate_dict_str_int(input_size):
-  sample_list = generate_list_of_ints(input_size)
-  return {generate_string(100): val for val in sample_list}
+def dict_str_int(size):


It seems for here (and elsewhere) size is always equal to NUM_ENTRIES_PER_COLLECTION.

It seems important to vary this parameter, e.g. if I were to list the things I wanted to benchmark it would be something like

small ints
large ints
small strings
large strings
small dicts
large dicts
small lists/tuples - 2 tuples are particularly common
large lists/tuples
windowed values with one window
windowed values with more than one window
...

On the other hand, dict_int_int vs dict_str_str probably isn't going to tell us much about the performance of dict.

tvalentyn · 2018-07-12T00:21:50Z

Thanks, @robertwb , PTAL at def093f, I squashed the previous commits into one.
Sample output (final part):

small_int, FastPrimitivesCoder, 1000 element(s)                                   : per element median time cost: 7.51019e-08 sec, std: 1.93803e-08 sec
large_int, FastPrimitivesCoder, 1000 element(s)                                   : per element median time cost: 1.44005e-07 sec, std: 1.9427e-08 sec
small_string, FastPrimitivesCoder, 1000 element(s)                                : per element median time cost: 4.32968e-07 sec, std: 1.71283e-08 sec
large_string, FastPrimitivesCoder, 1000 element(s)                                : per element median time cost: 8.88467e-07 sec, std: 2.8785e-08 sec
small_list, FastPrimitivesCoder, 1000 element(s)                                  : per element median time cost: 7.83443e-07 sec, std: 8.90221e-08 sec
small_list, IterableCoder[FastPrimitivesCoder], 1000 element(s)                   : per element median time cost: 7.51138e-07 sec, std: 9.73916e-08 sec
large_list, FastPrimitivesCoder, 1000 element(s)                                  : per element median time cost: 5.09765e-05 sec, std: 1.58987e-06 sec
large_list, IterableCoder[FastPrimitivesCoder], 1000 element(s)                   : per element median time cost: 5.0562e-05 sec, std: 1.45765e-06 sec
small_tuple, FastPrimitivesCoder, 1000 element(s)                                 : per element median time cost: 2.93493e-07 sec, std: 9.39728e-09 sec
large_tuple, FastPrimitivesCoder, 1000 element(s)                                 : per element median time cost: 5.41936e-05 sec, std: 1.00567e-06 sec
small_dict, FastPrimitivesCoder, 1000 element(s)                                  : per element median time cost: 1.54459e-06 sec, std: 2.14256e-08 sec
large_dict, FastPrimitivesCoder, 1000 element(s)                                  : per element median time cost: 1.86415e-05 sec, std: 2.64062e-07 sec
wv_with_one_window, WindowedValueCoder[FastPrimitivesCoder], 1000 element(s)      : per element median time cost: 2.40631e-05 sec, std: 2.03605e-07 sec
wv_with_multiple_windows, WindowedValueCoder[FastPrimitivesCoder], 1000 element(s): per element median time cost: 0.000737476 sec, std: 1.76045e-05 sec

robertwb

Thanks. Just two small comments, but LGTM from me (and you can find someone else to merge if I'm not around).

A follow-up would be the ability to easily save the results of a run and then compare it with a subsequent run--perhaps worth a TODO (but not more) at this point.

robertwb · 2018-07-13T21:54:00Z

sdks/python/apache_beam/tools/coders_microbenchmark.py

+
+  class CoderBenchmark(object):
+    def __init__(self, num_elements_per_benchmark):
+      self._coder = coders.IterableCoder(coder)


Add a comment about why we did this.

robertwb · 2018-07-13T21:57:41Z

sdks/python/apache_beam/tools/utils.py

+          numpy.median(cost_series[name]) / benchmark_config.size)
+      std = numpy.std(cost_series[name]) / benchmark_config.size
+
+      print("%s: per element median time cost: %g sec, std: %g sec" % (


Scale the std by the median (or give it as a percentage) so it's easier to see if the variance is significant.

Done.
Sample output:

small_int, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 7.60555e-08 sec, relative std: 27.47% large_int, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 1.50919e-07 sec, relative std: 3.13% small_string, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 4.40478e-07 sec, relative std: 2.62% large_string, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 8.64029e-07 sec, relative std: 3.34% small_list, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 7.69496e-07 sec, relative std: 3.20% small_list, IterableCoder[FastPrimitivesCoder], 1000 element(s) : per element median time cost: 7.24435e-07 sec, relative std: 3.05% large_list, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 5.04484e-05 sec, relative std: 1.27% large_list, IterableCoder[FastPrimitivesCoder], 1000 element(s) : per element median time cost: 5.01723e-05 sec, relative std: 0.92% small_tuple, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 2.95877e-07 sec, relative std: 3.43% large_tuple, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 5.34335e-05 sec, relative std: 1.22% small_dict, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 1.57547e-06 sec, relative std: 5.36% large_dict, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 1.9261e-05 sec, relative std: 8.33% wv_with_one_window, WindowedValueCoder[FastPrimitivesCoder], 1000 element(s) : per element median time cost: 2.5202e-05 sec, relative std: 0.77% wv_with_multiple_windows, WindowedValueCoder[FastPrimitivesCoder], 1000 element(s): per element median time cost: 0.000845308 sec, relative std: 11.42%

tvalentyn · 2018-07-13T23:17:47Z

Hmm looks like automated tooling added three reviewers with my last commit. As you can see above, this PR except for last minor commit has been already reviewed by @robertwb (Thanks a lot!).

aaltay · 2018-07-16T16:12:55Z

sdks/python/setup.py


 REQUIRED_TEST_PACKAGES = [
    'nose>=1.3.7',
+    'numpy',


numpy is already in REQUIRED_PERF_TEST_PACKAGES:

beam/sdks/python/setup.py

Line 132 in a563412

'numpy>=1.14.3',

REQUIRED_PERF_TEST_PACKAGES are not included into REQUIRED_TEST_PACKAGES, however this PR comes with a unit test that exercises the coders microbenchmark, and this code will require numpy. That said, I think we still need numpy in REQUIRED_TEST_PACKAGES.

…D_TEST_PACKAGES

tvalentyn · 2018-07-18T16:22:35Z

@aaltay Removed REQUIRED_PERF_TEST_PACKAGES per offline conversation, keeping numpy in REQUIRED_TEST_PACKAGES. I think this is ready to merge.

pabloem · 2018-07-18T16:30:43Z

I'll merge this now to have the requirement available for tests.

tvalentyn · 2018-07-18T17:04:36Z

Thank you, @pabloem.

aaltay · 2018-07-19T01:51:42Z

@tvalentyn Could you please add a version bound to numpy and cherry pick the change to the release branch?

'numpy>=1.14.3,<2' would be fine.

tvalentyn · 2018-07-19T04:25:34Z

#5992

tvalentyn force-pushed the coders_microbenchmark branch 2 times, most recently from 2087bef to 9b237fd Compare June 5, 2018 23:30

robertwb reviewed Jun 6, 2018

View reviewed changes

tvalentyn mentioned this pull request Jun 6, 2018

[BEAM-3981] Futurize coders subpackage #5053

Merged

tvalentyn force-pushed the coders_microbenchmark branch from 9b237fd to f0e2e30 Compare June 9, 2018 02:59

robertwb reviewed Jun 11, 2018

View reviewed changes

tvalentyn mentioned this pull request Jun 12, 2018

[BEAM-4008] Futurize utils subpackage #5336

Merged

robertwb reviewed Jun 27, 2018

View reviewed changes

tvalentyn force-pushed the coders_microbenchmark branch 2 times, most recently from 0ac77e9 to def093f Compare July 12, 2018 00:17

tvalentyn force-pushed the coders_microbenchmark branch from def093f to dd078a7 Compare July 12, 2018 00:36

tvalentyn mentioned this pull request Jul 12, 2018

[BEAM-1251] Replace NameError-driven dispatch with past #5869

Merged

robertwb approved these changes Jul 13, 2018

View reviewed changes

tvalentyn requested review from aaltay, charlesccychen and pabloem as code owners July 13, 2018 23:09

aaltay reviewed Jul 16, 2018

View reviewed changes

tvalentyn added 2 commits July 17, 2018 15:18

Add coders microbenchmark.

1eab20b

Remove REQUIRED_PERF_TEST_PACKAGES since numpy is included in REQUIRE…

bc6e63f

…D_TEST_PACKAGES

tvalentyn force-pushed the coders_microbenchmark branch from 6d1197d to bc6e63f Compare July 17, 2018 22:18

pabloem merged commit e466862 into apache:master Jul 18, 2018



		def encode_and_decode(coder, data):
		_ = coder.decode(coder.encode(data))



		def generate_list_of_ints(input_size, lower_bound=0, upper_bound=sys.maxsize):
		values = []



		def generate_dict_int_int(input_size):
		sample_list = generate_list_of_ints(input_size)

Add coders microbenchmark. #5565

Add coders microbenchmark. #5565

Uh oh!

Conversation

tvalentyn commented Jun 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sample output:

Uh oh!

tvalentyn commented Jun 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertwb commented Jun 7, 2018 via email

Uh oh!

robertwb commented Jun 7, 2018

Uh oh!

tvalentyn commented Jun 9, 2018

Uh oh!

robertwb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RobbeSneyders commented Jun 13, 2018

Uh oh!

tvalentyn commented Jun 13, 2018

Uh oh!

robertwb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tvalentyn commented Jul 12, 2018

Uh oh!

robertwb left a comment

Choose a reason for hiding this comment

tvalentyn commented Jun 5, 2018 •

edited

Loading