Skip to content

Conversation

@tvalentyn
Copy link
Contributor

@tvalentyn tvalentyn commented Jun 5, 2018

Add a microbenchmark to test performance of coders and a mini-framework for creating similar microbenchmarks.

Sample output:

Run 1 of 10:
Dict[int, int], FastPrimitiveCoder : per element time cost: 2.94209e-07 sec
Dict[str, int], FastPrimitiveCoder : per element time cost: 1.29011e-06 sec
List[int WV], IterableCoder+WVCoder+FPCoder: per element time cost: 7.58116e-05 sec
List[int], FastPrimitiveCoder : per element time cost: 1.17612e-07 sec
List[int], IterableCoder+FastPrimitiveCoder: per element time cost: 1.13797e-07 sec
Tuple[int], FastPrimitiveCoder : per element time cost: 1.22905e-07 sec

Run 2 of 10:
...

Median time cost:
Dict[int, int], FastPrimitiveCoder : per element median time cost: 3.12746e-07 sec
Dict[str, int], FastPrimitiveCoder : per element median time cost: 1.34674e-06 sec
List[int WV], IterableCoder+WVCoder+FPCoder: per element median time cost: 7.97846e-05 sec
List[int], FastPrimitiveCoder : per element median time cost: 1.22845e-07 sec
List[int], IterableCoder+FastPrimitiveCoder: per element median time cost: 1.22857e-07 sec
Tuple[int], FastPrimitiveCoder : per element median time cost: 1.26493e-07 sec

@tvalentyn tvalentyn force-pushed the coders_microbenchmark branch 2 times, most recently from 2087bef to 9b237fd Compare June 5, 2018 23:30
@tvalentyn
Copy link
Contributor Author

cc: @aaltay , @robertwb , @RobbeSneyders

# TODO(BEAM-4441): Pick coders using type hints, for example:
# tuple_coder = typecoders.registry.get_coder(typehints.Tuple[int])

benchmarks.append({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a huge amount of redundancy here, perhaps create a coder_benchmark function that takes a (name, generate_fn, coder) and produces this dict.

"name": "List[int], FastPrimitiveCoder",
"input_generator": lambda: generate_list_of_ints(input_size),
"op_fn": lambda input_: encode_and_decode(
coders.FastPrimitivesCoder(), input_),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is possibly a huge amount of overhead here that's not about actually doing the encoding. (E.g. creating the Coder, creating the impls, calling several Python functions, etc.)



def encode_and_decode(coder, data):
_ = coder.decode(coder.encode(data))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically encode() and decode() are not called individually. It would probably be better to encode a list of 100 or 1000 elements with the ListCoder to cut out the overheads.

on a generated input, collect and print its execution times.
Args:
benchmarks: A list of dictionaries that describe benchmarks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a named tuple?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's being too functional, but you could also let a benchmark be any callable that takes as input an input size, and returns a callable. It's name could be getattr(benchmark, 'name', str(benchmark)) and the input size would be provided externally.

One could implement benchmarks, of desired, in OO style as

class MyBenchmark(object):
  def __init__(self, size):
    [whatever initialization]
  def __call__(self):
    [run the code in question]

This would make it easy to write

def run_benchmarks([list of benchmark classes/callables], [list of sizes]):
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, @robertwb !
In this case it sounds like we'd have to create a new instance of MyBenchmark for every iteration. Perhaps it would be better to just go OO style as in

class Benchmark(object):
  def __init__(self):
     [whatever one-time initialization, for example instantiate coders and create an impl]

  def run(self):
    raise NotImplementedError()
    [run the code in question]
 
 def generate_next_input(size):
    raise NotImplementedError   

 def __str__():
    raise NotImplementedError
    # returns name of this Benchmark. Will be used as a key.

def run_benchmarks([list of (benchmark object, size, number_of_iterations)] ):
   # Returns dict [name->(list of execution times)] 
...

This way it seems that it is easier to separate one time initialization from input generation. Reusing object state may help to reduce the overheads related to benchmark execution.
What do you think?

# runs.
gc.collect()
time_cost = run(b)
cost_series[b["name"]].append(time_cost / b["input_size"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we retain these separately, we can do linear regression to sort out the overhead vs. the difference due to size.

@robertwb
Copy link
Contributor

robertwb commented Jun 7, 2018 via email

@robertwb
Copy link
Contributor

robertwb commented Jun 7, 2018

A meta-comment, this seems like something that could be bike-shedded a lot. I wonder if it'd be better to focus on making the ideal framework for benchmarking coders and later generalizing if that becomes useful.

@tvalentyn tvalentyn force-pushed the coders_microbenchmark branch from 9b237fd to f0e2e30 Compare June 9, 2018 02:59
@tvalentyn
Copy link
Contributor Author

Thanks @robertwb, PTAL.

Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Some more comments on the coders-specific part.

time_cost = run(benchmark_config.benchmark, size)
cost_series[name].append(time_cost)
if verbose:
avg_cost = time_cost/size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: My eyes jump form "avg_cost" to "median"cost" below, which are quite different. Maybe call it "scaled_cost"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, renamed the variables, should be more mnemonic now.

"""

def get_name(benchmark_config):
return getattr(benchmark_config.benchmark, '__name__',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just take benchmark in, rather than benchmark_config?


cost_series = collections.defaultdict(list)
for benchmark_config in benchmark_suite:
name = get_name(benchmark_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should name incorporate the size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thanks.

generate_windowed_value_list),
]

BenchmarkConfig = collections.namedtuple(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to tools/utils.py (or remove).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

size: int, a size of the input. Aggregated per-element metrics
are counted based on the size of the input.
num_runs: int, number of times to run each benchmark.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code indicates this is part of the tuple.



def generate_list_of_ints(input_size, lower_bound=0, upper_bound=sys.maxsize):
values = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List comprehension?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks



def generate_dict_int_int(input_size):
sample_list = generate_list_of_ints(input_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return {val: val for val in generate_list_of_ints(input_size)}


def generate_dict_str_int(input_size):
sample_list = generate_list_of_ints(input_size)
return {generate_string(100): val for val in sample_list}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

"List[int], FastPrimitiveCoder", coders.FastPrimitivesCoder(),
generate_list_of_ints),
coder_benchmark_factory(
"Tuple[int, int], FastPrimitiveCoder", coders.FastPrimitivesCoder(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Tuple[int, ...], not 2-tuples of ints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

self.data_ = generate_fn(size)

def call(self):
_ = self.coder_.decode(self.coder_.encode(self.data_))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We care more about the speed of encoding a list of 10000 small dicts, than a single dict of 10000 entries. The former better approximates how the code will get invoked on PCollections.

E.g. you could have

def coder_benchmark(name, coder, *sample_elements):
  def __init__(self, size):
    self._coder = IterableCoder(coder)
    self._elements = list(itertools.islice(itertools.cycle(sample_elements), size))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Let me know if that captures your idea. Currently benchmarks use a list of 1000 collections of size 50. The suite takes around 2 min to complete on my machine. Dict[str, int] and List of WindowedValues benchmarks take 10x longer to finish than others. With 10000 collections, the execution time might be a little longer than desirable for a quick feedback loop, but we can increase it if necessary.

@RobbeSneyders
Copy link
Contributor

I ran the benchmark based on this framework in master...tvalentyn:utils_futurization_benchmark a few times and noticed a difference of up to 12% between different runs with the same code.

Is there something we should do to run the benchmarks isolated?

@tvalentyn
Copy link
Contributor Author

I did not run the benchmark in isolation, just used my machine + virtual environment. I sometimes also observed difference in performance between some iterations. Increasing the number of iterations usually helps to make median more stable, could you try that?

Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks. Just a couple more thoughts.

cost_series[name].append(time_cost)
if verbose:
avg_cost = time_cost/size
per_element_cost = time_cost/size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: spaces around division operator.

def generate_dict_str_int(input_size):
sample_list = generate_list_of_ints(input_size)
return {generate_string(100): val for val in sample_list}
def dict_str_int(size):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems for here (and elsewhere) size is always equal to NUM_ENTRIES_PER_COLLECTION.

It seems important to vary this parameter, e.g. if I were to list the things I wanted to benchmark it would be something like

small ints
large ints
small strings
large strings
small dicts
large dicts
small lists/tuples - 2 tuples are particularly common
large lists/tuples
windowed values with one window
windowed values with more than one window
...

On the other hand, dict_int_int vs dict_str_str probably isn't going to tell us much about the performance of dict.

@tvalentyn tvalentyn force-pushed the coders_microbenchmark branch 2 times, most recently from 0ac77e9 to def093f Compare July 12, 2018 00:17
@tvalentyn
Copy link
Contributor Author

Thanks, @robertwb , PTAL at def093f, I squashed the previous commits into one.
Sample output (final part):

small_int, FastPrimitivesCoder, 1000 element(s)                                   : per element median time cost: 7.51019e-08 sec, std: 1.93803e-08 sec
large_int, FastPrimitivesCoder, 1000 element(s)                                   : per element median time cost: 1.44005e-07 sec, std: 1.9427e-08 sec
small_string, FastPrimitivesCoder, 1000 element(s)                                : per element median time cost: 4.32968e-07 sec, std: 1.71283e-08 sec
large_string, FastPrimitivesCoder, 1000 element(s)                                : per element median time cost: 8.88467e-07 sec, std: 2.8785e-08 sec
small_list, FastPrimitivesCoder, 1000 element(s)                                  : per element median time cost: 7.83443e-07 sec, std: 8.90221e-08 sec
small_list, IterableCoder[FastPrimitivesCoder], 1000 element(s)                   : per element median time cost: 7.51138e-07 sec, std: 9.73916e-08 sec
large_list, FastPrimitivesCoder, 1000 element(s)                                  : per element median time cost: 5.09765e-05 sec, std: 1.58987e-06 sec
large_list, IterableCoder[FastPrimitivesCoder], 1000 element(s)                   : per element median time cost: 5.0562e-05 sec, std: 1.45765e-06 sec
small_tuple, FastPrimitivesCoder, 1000 element(s)                                 : per element median time cost: 2.93493e-07 sec, std: 9.39728e-09 sec
large_tuple, FastPrimitivesCoder, 1000 element(s)                                 : per element median time cost: 5.41936e-05 sec, std: 1.00567e-06 sec
small_dict, FastPrimitivesCoder, 1000 element(s)                                  : per element median time cost: 1.54459e-06 sec, std: 2.14256e-08 sec
large_dict, FastPrimitivesCoder, 1000 element(s)                                  : per element median time cost: 1.86415e-05 sec, std: 2.64062e-07 sec
wv_with_one_window, WindowedValueCoder[FastPrimitivesCoder], 1000 element(s)      : per element median time cost: 2.40631e-05 sec, std: 2.03605e-07 sec
wv_with_multiple_windows, WindowedValueCoder[FastPrimitivesCoder], 1000 element(s): per element median time cost: 0.000737476 sec, std: 1.76045e-05 sec

Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Just two small comments, but LGTM from me (and you can find someone else to merge if I'm not around).

A follow-up would be the ability to easily save the results of a run and then compare it with a subsequent run--perhaps worth a TODO (but not more) at this point.


class CoderBenchmark(object):
def __init__(self, num_elements_per_benchmark):
self._coder = coders.IterableCoder(coder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment about why we did this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

numpy.median(cost_series[name]) / benchmark_config.size)
std = numpy.std(cost_series[name]) / benchmark_config.size

print("%s: per element median time cost: %g sec, std: %g sec" % (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scale the std by the median (or give it as a percentage) so it's easier to see if the variance is significant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Sample output:

small_int, FastPrimitivesCoder, 1000 element(s)                                   : per element median time cost: 7.60555e-08 sec, relative std: 27.47%
large_int, FastPrimitivesCoder, 1000 element(s)                                   : per element median time cost: 1.50919e-07 sec, relative std: 3.13%
small_string, FastPrimitivesCoder, 1000 element(s)                                : per element median time cost: 4.40478e-07 sec, relative std: 2.62%
large_string, FastPrimitivesCoder, 1000 element(s)                                : per element median time cost: 8.64029e-07 sec, relative std: 3.34%
small_list, FastPrimitivesCoder, 1000 element(s)                                  : per element median time cost: 7.69496e-07 sec, relative std: 3.20%
small_list, IterableCoder[FastPrimitivesCoder], 1000 element(s)                   : per element median time cost: 7.24435e-07 sec, relative std: 3.05%
large_list, FastPrimitivesCoder, 1000 element(s)                                  : per element median time cost: 5.04484e-05 sec, relative std: 1.27%
large_list, IterableCoder[FastPrimitivesCoder], 1000 element(s)                   : per element median time cost: 5.01723e-05 sec, relative std: 0.92%
small_tuple, FastPrimitivesCoder, 1000 element(s)                                 : per element median time cost: 2.95877e-07 sec, relative std: 3.43%
large_tuple, FastPrimitivesCoder, 1000 element(s)                                 : per element median time cost: 5.34335e-05 sec, relative std: 1.22%
small_dict, FastPrimitivesCoder, 1000 element(s)                                  : per element median time cost: 1.57547e-06 sec, relative std: 5.36%
large_dict, FastPrimitivesCoder, 1000 element(s)                                  : per element median time cost: 1.9261e-05 sec, relative std: 8.33%
wv_with_one_window, WindowedValueCoder[FastPrimitivesCoder], 1000 element(s)      : per element median time cost: 2.5202e-05 sec, relative std: 0.77%
wv_with_multiple_windows, WindowedValueCoder[FastPrimitivesCoder], 1000 element(s): per element median time cost: 0.000845308 sec, relative std: 11.42%

@tvalentyn
Copy link
Contributor Author

Hmm looks like automated tooling added three reviewers with my last commit. As you can see above, this PR except for last minor commit has been already reviewed by @robertwb (Thanks a lot!).


REQUIRED_TEST_PACKAGES = [
'nose>=1.3.7',
'numpy',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy is already in REQUIRED_PERF_TEST_PACKAGES:

'numpy>=1.14.3',

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REQUIRED_PERF_TEST_PACKAGES are not included into REQUIRED_TEST_PACKAGES, however this PR comes with a unit test that exercises the coders microbenchmark, and this code will require numpy. That said, I think we still need numpy in REQUIRED_TEST_PACKAGES.

@tvalentyn tvalentyn force-pushed the coders_microbenchmark branch from 6d1197d to bc6e63f Compare July 17, 2018 22:18
@tvalentyn
Copy link
Contributor Author

@aaltay Removed REQUIRED_PERF_TEST_PACKAGES per offline conversation, keeping numpy in REQUIRED_TEST_PACKAGES. I think this is ready to merge.

@pabloem
Copy link
Member

pabloem commented Jul 18, 2018

I'll merge this now to have the requirement available for tests.

@pabloem pabloem merged commit e466862 into apache:master Jul 18, 2018
@tvalentyn
Copy link
Contributor Author

Thank you, @pabloem.

@aaltay
Copy link
Member

aaltay commented Jul 19, 2018

@tvalentyn Could you please add a version bound to numpy and cherry pick the change to the release branch?

'numpy>=1.14.3,<2' would be fine.

@tvalentyn
Copy link
Contributor Author

#5992

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants