-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Add coders microbenchmark. #5565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2087bef to
9b237fd
Compare
|
cc: @aaltay , @robertwb , @RobbeSneyders |
| # TODO(BEAM-4441): Pick coders using type hints, for example: | ||
| # tuple_coder = typecoders.registry.get_coder(typehints.Tuple[int]) | ||
|
|
||
| benchmarks.append({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a huge amount of redundancy here, perhaps create a coder_benchmark function that takes a (name, generate_fn, coder) and produces this dict.
| "name": "List[int], FastPrimitiveCoder", | ||
| "input_generator": lambda: generate_list_of_ints(input_size), | ||
| "op_fn": lambda input_: encode_and_decode( | ||
| coders.FastPrimitivesCoder(), input_), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is possibly a huge amount of overhead here that's not about actually doing the encoding. (E.g. creating the Coder, creating the impls, calling several Python functions, etc.)
|
|
||
|
|
||
| def encode_and_decode(coder, data): | ||
| _ = coder.decode(coder.encode(data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically encode() and decode() are not called individually. It would probably be better to encode a list of 100 or 1000 elements with the ListCoder to cut out the overheads.
| on a generated input, collect and print its execution times. | ||
| Args: | ||
| benchmarks: A list of dictionaries that describe benchmarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using a named tuple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it's being too functional, but you could also let a benchmark be any callable that takes as input an input size, and returns a callable. It's name could be getattr(benchmark, 'name', str(benchmark)) and the input size would be provided externally.
One could implement benchmarks, of desired, in OO style as
class MyBenchmark(object):
def __init__(self, size):
[whatever initialization]
def __call__(self):
[run the code in question]
This would make it easy to write
def run_benchmarks([list of benchmark classes/callables], [list of sizes]):
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, @robertwb !
In this case it sounds like we'd have to create a new instance of MyBenchmark for every iteration. Perhaps it would be better to just go OO style as in
class Benchmark(object):
def __init__(self):
[whatever one-time initialization, for example instantiate coders and create an impl]
def run(self):
raise NotImplementedError()
[run the code in question]
def generate_next_input(size):
raise NotImplementedError
def __str__():
raise NotImplementedError
# returns name of this Benchmark. Will be used as a key.
def run_benchmarks([list of (benchmark object, size, number_of_iterations)] ):
# Returns dict [name->(list of execution times)]
...
This way it seems that it is easier to separate one time initialization from input generation. Reusing object state may help to reduce the overheads related to benchmark execution.
What do you think?
| # runs. | ||
| gc.collect() | ||
| time_cost = run(b) | ||
| cost_series[b["name"]].append(time_cost / b["input_size"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we retain these separately, we can do linear regression to sort out the overhead vs. the difference due to size.
|
On Thu, Jun 7, 2018 at 12:08 PM tvalentyn ***@***.***> wrote:
***@***.**** commented on this pull request.
Thanks for the feedback, @robertwb <https://github.com/robertwb> !
In this case it sounds like we'd have to create a new instance of
MyBenchmark for every iteration.
Correct. The class represents a set of possible benchmarks, the instance a
concrete one that can be executed.
Perhaps it would be better to just go OO style as in
class Benchmark(object):
def __init__(self):
[whatever one-time initialization, for example instantiate coders and create an impl]
def run(self):
raise NotImplementedError()
[run the code in question]
def generate_next_input(size):
raise NotImplementedError
def __str__():
raise NotImplementedError
# returns name of this Benchmark. Will be used as a key.
def run_benchmarks([list of (benchmark object, size, number_of_iterations)] ):
# Returns dict [name->(list of execution times)]
...
This way it seems that it is easier to separate one time initialization
from input generation. Reusing object state may help to reduce the
overheads related to benchmark execution.
What do you think?
Hmm... I would say in this case the complexity of splitting up
initialization vs. input generation (v.s. just calling all of it
"initialization") isn't going to be a big savings. More importantly,
however, Benchmarks now become mutable objects, and a particular
configuration is an instance after a certain method has been called with a
particular parameter rather than just an instance.
|
|
A meta-comment, this seems like something that could be bike-shedded a lot. I wonder if it'd be better to focus on making the ideal framework for benchmarking coders and later generalizing if that becomes useful. |
9b237fd to
f0e2e30
Compare
|
Thanks @robertwb, PTAL. |
robertwb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Some more comments on the coders-specific part.
| time_cost = run(benchmark_config.benchmark, size) | ||
| cost_series[name].append(time_cost) | ||
| if verbose: | ||
| avg_cost = time_cost/size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: My eyes jump form "avg_cost" to "median"cost" below, which are quite different. Maybe call it "scaled_cost"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, renamed the variables, should be more mnemonic now.
| """ | ||
|
|
||
| def get_name(benchmark_config): | ||
| return getattr(benchmark_config.benchmark, '__name__', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just take benchmark in, rather than benchmark_config?
|
|
||
| cost_series = collections.defaultdict(list) | ||
| for benchmark_config in benchmark_suite: | ||
| name = get_name(benchmark_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should name incorporate the size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, thanks.
| generate_windowed_value_list), | ||
| ] | ||
|
|
||
| BenchmarkConfig = collections.namedtuple( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to tools/utils.py (or remove).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done.
| size: int, a size of the input. Aggregated per-element metrics | ||
| are counted based on the size of the input. | ||
| num_runs: int, number of times to run each benchmark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code indicates this is part of the tuple.
|
|
||
|
|
||
| def generate_list_of_ints(input_size, lower_bound=0, upper_bound=sys.maxsize): | ||
| values = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List comprehension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks
|
|
||
|
|
||
| def generate_dict_int_int(input_size): | ||
| sample_list = generate_list_of_ints(input_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return {val: val for val in generate_list_of_ints(input_size)}
|
|
||
| def generate_dict_str_int(input_size): | ||
| sample_list = generate_list_of_ints(input_size) | ||
| return {generate_string(100): val for val in sample_list} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
| "List[int], FastPrimitiveCoder", coders.FastPrimitivesCoder(), | ||
| generate_list_of_ints), | ||
| coder_benchmark_factory( | ||
| "Tuple[int, int], FastPrimitiveCoder", coders.FastPrimitivesCoder(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Tuple[int, ...], not 2-tuples of ints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| self.data_ = generate_fn(size) | ||
|
|
||
| def call(self): | ||
| _ = self.coder_.decode(self.coder_.encode(self.data_)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We care more about the speed of encoding a list of 10000 small dicts, than a single dict of 10000 entries. The former better approximates how the code will get invoked on PCollections.
E.g. you could have
def coder_benchmark(name, coder, *sample_elements):
def __init__(self, size):
self._coder = IterableCoder(coder)
self._elements = list(itertools.islice(itertools.cycle(sample_elements), size))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed. Let me know if that captures your idea. Currently benchmarks use a list of 1000 collections of size 50. The suite takes around 2 min to complete on my machine. Dict[str, int] and List of WindowedValues benchmarks take 10x longer to finish than others. With 10000 collections, the execution time might be a little longer than desirable for a quick feedback loop, but we can increase it if necessary.
|
I ran the benchmark based on this framework in master...tvalentyn:utils_futurization_benchmark a few times and noticed a difference of up to 12% between different runs with the same code. Is there something we should do to run the benchmarks isolated? |
|
I did not run the benchmark in isolation, just used my machine + virtual environment. I sometimes also observed difference in performance between some iterations. Increasing the number of iterations usually helps to make median more stable, could you try that? |
robertwb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks. Just a couple more thoughts.
| cost_series[name].append(time_cost) | ||
| if verbose: | ||
| avg_cost = time_cost/size | ||
| per_element_cost = time_cost/size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: spaces around division operator.
| def generate_dict_str_int(input_size): | ||
| sample_list = generate_list_of_ints(input_size) | ||
| return {generate_string(100): val for val in sample_list} | ||
| def dict_str_int(size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems for here (and elsewhere) size is always equal to NUM_ENTRIES_PER_COLLECTION.
It seems important to vary this parameter, e.g. if I were to list the things I wanted to benchmark it would be something like
small ints
large ints
small strings
large strings
small dicts
large dicts
small lists/tuples - 2 tuples are particularly common
large lists/tuples
windowed values with one window
windowed values with more than one window
...
On the other hand, dict_int_int vs dict_str_str probably isn't going to tell us much about the performance of dict.
0ac77e9 to
def093f
Compare
|
Thanks, @robertwb , PTAL at def093f, I squashed the previous commits into one. |
def093f to
dd078a7
Compare
robertwb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Just two small comments, but LGTM from me (and you can find someone else to merge if I'm not around).
A follow-up would be the ability to easily save the results of a run and then compare it with a subsequent run--perhaps worth a TODO (but not more) at this point.
|
|
||
| class CoderBenchmark(object): | ||
| def __init__(self, num_elements_per_benchmark): | ||
| self._coder = coders.IterableCoder(coder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment about why we did this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| numpy.median(cost_series[name]) / benchmark_config.size) | ||
| std = numpy.std(cost_series[name]) / benchmark_config.size | ||
|
|
||
| print("%s: per element median time cost: %g sec, std: %g sec" % ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scale the std by the median (or give it as a percentage) so it's easier to see if the variance is significant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Sample output:
small_int, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 7.60555e-08 sec, relative std: 27.47%
large_int, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 1.50919e-07 sec, relative std: 3.13%
small_string, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 4.40478e-07 sec, relative std: 2.62%
large_string, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 8.64029e-07 sec, relative std: 3.34%
small_list, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 7.69496e-07 sec, relative std: 3.20%
small_list, IterableCoder[FastPrimitivesCoder], 1000 element(s) : per element median time cost: 7.24435e-07 sec, relative std: 3.05%
large_list, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 5.04484e-05 sec, relative std: 1.27%
large_list, IterableCoder[FastPrimitivesCoder], 1000 element(s) : per element median time cost: 5.01723e-05 sec, relative std: 0.92%
small_tuple, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 2.95877e-07 sec, relative std: 3.43%
large_tuple, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 5.34335e-05 sec, relative std: 1.22%
small_dict, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 1.57547e-06 sec, relative std: 5.36%
large_dict, FastPrimitivesCoder, 1000 element(s) : per element median time cost: 1.9261e-05 sec, relative std: 8.33%
wv_with_one_window, WindowedValueCoder[FastPrimitivesCoder], 1000 element(s) : per element median time cost: 2.5202e-05 sec, relative std: 0.77%
wv_with_multiple_windows, WindowedValueCoder[FastPrimitivesCoder], 1000 element(s): per element median time cost: 0.000845308 sec, relative std: 11.42%
|
Hmm looks like automated tooling added three reviewers with my last commit. As you can see above, this PR except for last minor commit has been already reviewed by @robertwb (Thanks a lot!). |
|
|
||
| REQUIRED_TEST_PACKAGES = [ | ||
| 'nose>=1.3.7', | ||
| 'numpy', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numpy is already in REQUIRED_PERF_TEST_PACKAGES:
Line 132 in a563412
| 'numpy>=1.14.3', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
REQUIRED_PERF_TEST_PACKAGES are not included into REQUIRED_TEST_PACKAGES, however this PR comes with a unit test that exercises the coders microbenchmark, and this code will require numpy. That said, I think we still need numpy in REQUIRED_TEST_PACKAGES.
6d1197d to
bc6e63f
Compare
|
@aaltay Removed |
|
I'll merge this now to have the requirement available for tests. |
|
Thank you, @pabloem. |
|
@tvalentyn Could you please add a version bound to numpy and cherry pick the change to the release branch?
|
Add a microbenchmark to test performance of coders and a mini-framework for creating similar microbenchmarks.
Sample output:
Run 1 of 10:
Dict[int, int], FastPrimitiveCoder : per element time cost: 2.94209e-07 sec
Dict[str, int], FastPrimitiveCoder : per element time cost: 1.29011e-06 sec
List[int WV], IterableCoder+WVCoder+FPCoder: per element time cost: 7.58116e-05 sec
List[int], FastPrimitiveCoder : per element time cost: 1.17612e-07 sec
List[int], IterableCoder+FastPrimitiveCoder: per element time cost: 1.13797e-07 sec
Tuple[int], FastPrimitiveCoder : per element time cost: 1.22905e-07 sec
Run 2 of 10:
...
Median time cost:
Dict[int, int], FastPrimitiveCoder : per element median time cost: 3.12746e-07 sec
Dict[str, int], FastPrimitiveCoder : per element median time cost: 1.34674e-06 sec
List[int WV], IterableCoder+WVCoder+FPCoder: per element median time cost: 7.97846e-05 sec
List[int], FastPrimitiveCoder : per element median time cost: 1.22845e-07 sec
List[int], IterableCoder+FastPrimitiveCoder: per element median time cost: 1.22857e-07 sec
Tuple[int], FastPrimitiveCoder : per element median time cost: 1.26493e-07 sec