[BEAM-4006] Futurize transforms subpackage #5729

Fematich · 2018-06-21T22:05:55Z

This pull request prepares the transforms subpackage for Python 3 support. This PR is part of a series in which all subpackages will be updated using the same approach.
This approach has been documented here and the first pull request in the series (Futurize coders subpackage) demonstrating this approach can be found at #5053.

R: @aaltay @tvalentyn @RobbeSneyders @charlesccychen

superbobry

General question: does pre-commit on Jenkins run whitelisted unittests on Python 3?

superbobry · 2018-06-28T13:20:58Z

sdks/python/apache_beam/transforms/combiners_test.py

 import itertools
 import random
 import unittest
+from builtins import range


from future.builtins import range?

superbobry · 2018-06-28T13:22:56Z

sdks/python/apache_beam/transforms/create_source.py

+      avg_size_per_value = self._total_size // len(self._serialized_values)
      num_values_per_split = max(
-          int(desired_bundle_size / avg_size_per_value), 1)
+          int(desired_bundle_size // avg_size_per_value), 1)


No need for an int call?

We still need to coerce it into an int; e.g. 4 // 7.0 has value 0.0 but type float.

Good point. I didn't realise that avg_size_per_value could be a float.

superbobry · 2018-06-28T15:05:47Z

sdks/python/apache_beam/transforms/trigger.py


  def known_windows(self):
-    return self.window_ids.keys()
+    return list(self.window_ids.keys())


list(self.window_ids)?

superbobry · 2018-06-28T15:06:35Z

sdks/python/apache_beam/transforms/trigger.py

      return NotImplemented

+  def __hash__(self):
+    return hash(self)


This will recurse infinitely, I think.

Exactly thx! Fixed now with: return hash(tuple(self))

superbobry · 2018-06-28T15:07:43Z

sdks/python/apache_beam/transforms/util.py


  def _extract_input_pvalues(self, pvalueish):
    try:
      # If this works, it's a dict.


Out of scope for this PR, but I'm curious, why not just do an isintance check?

superbobry · 2018-06-28T15:09:05Z

sdks/python/apache_beam/transforms/window.py


  def __eq__(self, other):
-    raise NotImplementedError
+    return self.cmp(other) == 0


Consider using a total_ordering backport as in TimestampedValue.

By only implementing 2 methods (as in the example of TimestampedValue),
test_sessions_after_all (apache_beam.transforms.trigger_test.TriggerTest) didn't run

superbobry · 2018-06-28T15:09:18Z

sdks/python/apache_beam/transforms/window.py

+  def __hash__(self):
+    return hash(self)
+
+  # def __lt__(self, other):


Leftover code?

superbobry · 2018-06-28T15:09:31Z

sdks/python/apache_beam/transforms/window.py


  def __hash__(self):
-    return hash(self.end)
+    return hash(self)


This will recurse infinitely as well.

superbobry · 2018-06-28T15:10:15Z

sdks/python/apache_beam/transforms/window.py


-  def __cmp__(self, other):
+  def __eq__(self, other):
+    return (type(self) == type(other)) and (self.value == other.value) and \


Nitpick: parens are not required here.

superbobry · 2018-06-28T15:18:42Z

sdks/python/apache_beam/transforms/window.py

    if type(self) is not type(other):
-      return cmp(type(self), type(other))
-    return cmp((self.value, self.timestamp), (other.value, other.timestamp))
+      return type(self) < type(other)


I fear this might fail since types are not comparable in Python 3 unless you use a custom metaclass (which is not the case here).

Indeed! Using the hash of type should resolve this issue.

charlesccychen

Thanks!

charlesccychen · 2018-07-03T01:06:19Z

sdks/python/apache_beam/pipeline_test.py

    self.assertEquals(
        set(['from_dictionary', 'get_all_options', 'slices', 'style',
-             'view_as', 'display_data']),
+             'view_as', 'display_data', 'next']),


This is fine, but was there a particular reason it was added?

This is because of the import from builtins import object in apache_beam/transforms/display.py.

This import adds an alias:
next = __next__ for Python2 and Python3 compatibility.

PipelineOptions (the tested class in this test) inherits from HasDisplayData class defined in the display.py module.

charlesccychen · 2018-07-03T01:07:15Z

sdks/python/apache_beam/transforms/combiners_test.py

      def match(actual):
        equal_to([1])([len(actual)])
-        equal_to(pairs)(actual[0].iteritems())
+        equal_to(pairs)(iteritems(actual[0]))


We can just use .items() here.

charlesccychen · 2018-07-03T01:08:13Z

sdks/python/apache_beam/transforms/core.py

    """
    super(Create, self).__init__()
-    if isinstance(value, string_types):
+    if isinstance(value, (unicode, str)):


Should we add bytes here?

Based on the requirements of value, I think we should actually check if value is an iterable:
hasattr(values,'__iter__') which also fails for string_types

The intention of this line is to prohibit string types from being returned where we expect an iterable of items. Strings technically are iterable (and return individual characters), so we want to prevent them from being returned accidentally (e.g., a user may intend to return a single string, but we don't want to interpret it as its individual characters). In both Python 2 and Python 3, string types are iterable, so I think we should add bytes to this list of "blacklisted" return types.

I have added bytes in the list!
Strings in Python2 didn't have a __iter__() method, the for loop functionality was provided by the __getitem__() method. That's the reason I suggested to use the hasattr(values,'__iter__') attribute check. However, in Python3 __iter__() is available for Strings as well, so the check wouldn't work for it.

charlesccychen · 2018-07-03T01:09:38Z

sdks/python/apache_beam/transforms/create_source.py

+      avg_size_per_value = self._total_size // len(self._serialized_values)
      num_values_per_split = max(
-          int(desired_bundle_size / avg_size_per_value), 1)
+          int(desired_bundle_size // avg_size_per_value), 1)


We still need to coerce it into an int; e.g. 4 // 7.0 has value 0.0 but type float.

charlesccychen · 2018-07-03T01:11:33Z

sdks/python/apache_beam/transforms/display.py

    items = {k: (v if DisplayDataItem._get_value_type(v) is not None
                 else str(v))
-             for k, v in pipeline_options.display_data().items()}
+             for k, v in iteritems(pipeline_options.display_data())}


Does this actually need to change?

I added for efficiency in Python2, but this will actually not be called that often, so I'll revert it.

charlesccychen · 2018-07-03T01:13:19Z

sdks/python/apache_beam/transforms/ptransform.py


    if (any([isinstance(v, pvalue.PCollection) for v in args]) or
-        any([isinstance(v, pvalue.PCollection) for v in kwargs.itervalues()])):
+        any([isinstance(v, pvalue.PCollection) for v in itervalues(kwargs)])):


Can we just do .values()?

This is a bit less optimal in Python2, but since it's only in the init, I'll change it to .values()

charlesccychen · 2018-07-03T01:14:45Z

sdks/python/apache_beam/transforms/sideinputs_test.py

        equal_to([expected_elem])([actual_elem])
        equal_to(expected_list)(actual_list)
-        equal_to(expected_pairs)(actual_dict.iteritems())
+        equal_to(expected_pairs)(iteritems(actual_dict))


Can we just use .items()?

charlesccychen · 2018-07-03T01:14:49Z

sdks/python/apache_beam/transforms/sideinputs_test.py

        equal_to([expected_elem])([actual_elem])
-        equal_to(expected_kvs)(actual_dict1.iteritems())
-        equal_to(expected_kvs)(actual_dict2.iteritems())
+        equal_to(expected_kvs)(iteritems(actual_dict1))


Can we just use .items()?

charlesccychen · 2018-07-03T01:17:34Z

sdks/python/apache_beam/transforms/trigger_test.py

              'B-4': {6, 7, 8, 9},
              'B-3': {10, 15, 16},
-          }.iteritems()))
+          })))


How about just .items()?

charlesccychen · 2018-07-03T01:19:49Z

sdks/python/apache_beam/transforms/window.py


-  def __cmp__(self, other):
+  def __eq__(self, other):
+    return (type(self) == type(other)) and (self.value == other.value) and \


Please avoid backslash continuation.

charlesccychen · 2018-07-08T21:26:15Z

Oops, it looks like there is something wrong with the commit history--a bunch of Java changes are being pulled in. Can you rebase to master, cherrypick and / or squash everything into one commit?

Fematich · 2018-07-08T22:15:48Z

Yes, sorry for that! Should now be cleaned up :-).

charlesccychen · 2018-07-09T08:02:31Z

sdks/python/apache_beam/transforms/window.py


  def __eq__(self, other):
-    raise NotImplementedError
+    return self.cmp(other) == 0


It doesn't look like this takes care of the case where other is not of type BoundedWindow.

Now equivalent again to the original code.

charlesccychen · 2018-07-09T08:04:34Z

sdks/python/apache_beam/transforms/window.py

+    type_eq = type(self) == type(other)
+    value_eq = self.value == other.value
+    timestamp_eq = self.timestamp == other.timestamp
+    return type_eq and value_eq and timestamp_eq


This will not work correctly, as the previous code relied on the short-circuiting behavior of "and". Accessing other.value will not work if the type is not as we expected.

charlesccychen · 2018-07-09T17:48:44Z

Run Python PostCommit

charlesccychen · 2018-07-09T18:07:16Z

sdks/python/apache_beam/transforms/util.py

 """

 from __future__ import absolute_import
+from __future__ import division


This change requires further changes in this file. In _BatchSizeEstimator._thin_data on line 285, we need to explicitly cast to int here; otherwise, we get the following error:

Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 642, in do_work work_executor.execute() File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 156, in execute op.start() File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.native_operations.NativeReadOperation.start def start(self): File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.native_operations.NativeReadOperation.start with self.scoped_start_state: File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.native_operations.NativeReadOperation.start with self.spec.source.reader() as reader: File "dataflow_worker/native_operations.py", line 54, in dataflow_worker.native_operations.NativeReadOperation.start self.output(windowed_value) File "apache_beam/runners/worker/operations.py", line 175, in apache_beam.runners.worker.operations.Operation.output cython.cast(Receiver, self.receivers[output_index]).receive(windowed_value) File "apache_beam/runners/worker/operations.py", line 85, in apache_beam.runners.worker.operations.ConsumerSet.receive cython.cast(Operation, consumer).process(windowed_value) File "apache_beam/runners/worker/operations.py", line 403, in apache_beam.runners.worker.operations.DoOperation.process with self.scoped_process_state: File "apache_beam/runners/worker/operations.py", line 404, in apache_beam.runners.worker.operations.DoOperation.process self.dofn_receiver.receive(o) File "apache_beam/runners/common.py", line 569, in apache_beam.runners.common.DoFnRunner.receive self.process(windowed_value) File "apache_beam/runners/common.py", line 577, in apache_beam.runners.common.DoFnRunner.process self._reraise_augmented(exn) File "apache_beam/runners/common.py", line 602, in apache_beam.runners.common.DoFnRunner._reraise_augmented raise File "apache_beam/runners/common.py", line 575, in apache_beam.runners.common.DoFnRunner.process self.do_fn_invoker.invoke_process(windowed_value) File "apache_beam/runners/common.py", line 352, in apache_beam.runners.common.SimpleInvoker.invoke_process output_processor.process_outputs( File "apache_beam/runners/common.py", line 673, in apache_beam.runners.common._OutputProcessor.process_outputs self.main_receivers.receive(windowed_value) File "apache_beam/runners/worker/operations.py", line 85, in apache_beam.runners.worker.operations.ConsumerSet.receive cython.cast(Operation, consumer).process(windowed_value) File "apache_beam/runners/worker/operations.py", line 403, in apache_beam.runners.worker.operations.DoOperation.process with self.scoped_process_state: File "apache_beam/runners/worker/operations.py", line 404, in apache_beam.runners.worker.operations.DoOperation.process self.dofn_receiver.receive(o) File "apache_beam/runners/common.py", line 569, in apache_beam.runners.common.DoFnRunner.receive self.process(windowed_value) File "apache_beam/runners/common.py", line 577, in apache_beam.runners.common.DoFnRunner.process self._reraise_augmented(exn) File "apache_beam/runners/common.py", line 618, in apache_beam.runners.common.DoFnRunner._reraise_augmented six.reraise(type(new_exn), new_exn, original_traceback) File "apache_beam/runners/common.py", line 575, in apache_beam.runners.common.DoFnRunner.process self.do_fn_invoker.invoke_process(windowed_value) File "apache_beam/runners/common.py", line 352, in apache_beam.runners.common.SimpleInvoker.invoke_process output_processor.process_outputs( File "apache_beam/runners/common.py", line 651, in apache_beam.runners.common._OutputProcessor.process_outputs for result in results: File "/usr/local/lib/python2.7/dist-packages/apache_beam/transforms/util.py", line 345, in process yield self._batch File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/local/lib/python2.7/dist-packages/apache_beam/transforms/util.py", line 268, in record_time self._thin_data() File "/usr/local/lib/python2.7/dist-packages/apache_beam/transforms/util.py", line 287, in _thin_data + odd_one_out) TypeError: slice indices must be integers or None or have an __index__ method [while running 'Analyze/RunPhase[0]/BatchAnalyzerInputs/BatchElements/ParDo(_GlobalWindowsBatchingDoFn)']

treshold variable is now an int and Python PostCommit succeeds

Fematich · 2018-07-10T12:37:03Z

Run Python PostCommit

charlesccychen · 2018-07-12T08:30:39Z

Unfortunately, this change is seen to produce a ~15% regression in internal Dataflow benchmarks. We have to investigate this regression before merging.

CC: @tvalentyn

Fematich · 2018-07-12T09:43:01Z

sdks/python/apache_beam/transforms/core.py

    class ReiterableNonEmptyAccumulators(object):
      def __iter__(self):
-        return itertools.ifilter(filter_fn, accumulators)
+        return filter(filter_fn, accumulators)


I think this might be a potential source for the performance loss --> I'll update this to use ifilter on PY2

@charlesccychen and @tvalentyn: is there more detailed info on the benchmarks?

Thank you. Let me test the pipeline with this change. Unfortunately it's not easy to export the benchmark data.

charlesccychen · 2018-07-12T18:38:35Z

Unfortunately, the ifilter change here (#5729 (comment)) doesn't fix the regression.

superbobry

@charlesccychen could you share which subsystems are being benchmarked? The benchmark source code is not part of the beam repo, right?

superbobry · 2018-07-12T22:53:51Z

sdks/python/apache_beam/transforms/core.py

 from apache_beam.utils import urns

+try:
+  from itertools import ifilter as filter


How about

from future.builtins import filter

superbobry · 2018-07-12T22:54:28Z

sdks/python/apache_beam/transforms/core.py

    timestamp_combiner = kwargs.pop('timestamp_combiner', None)
    if kwargs:
-      raise ValueError('Unexpected keyword arguments: %s' % kwargs.keys())
+      raise ValueError('Unexpected keyword arguments: %s' % list(kwargs.keys()))


Just list(kwargs) will work too.

superbobry · 2018-07-12T22:55:23Z

sdks/python/apache_beam/transforms/core.py

    """
    super(Create, self).__init__()
-    if isinstance(value, string_types):
+    if isinstance(value, (unicode, str, bytes)):


No need to check for both str and bytes since Python 2.7 defines bytes == str and on Python 3.X unicode == str.

See #5729 (comment). Bytes in Python3 also shouldn't be allowed since we don't want to support creation of a PCollection of single bytes.

superbobry · 2018-07-12T22:58:19Z

sdks/python/apache_beam/transforms/trigger.py

 from apache_beam.utils.timestamp import TIME_GRANULARITY

 # AfterCount is experimental. No backwards compatibility guarantees.
+try:


I think this could be rewritten using future.moves.itertools.

superbobry · 2018-07-12T22:59:52Z

sdks/python/apache_beam/transforms/window.py

    return self.end.predecessor()

-  def __cmp__(self, other):
+  def cmp(self, other):


Consider using @total_ordering as well.

Adds some extra delay

superbobry · 2018-07-12T23:00:41Z

sdks/python/apache_beam/transforms/window.py

        min(self.start, other.start), max(self.end, other.end))


+@total_ordering


I wonder if the slowdown is due to the indirection introduced by @total_ordering?

I think it is, working on a version without @total_ordering and the use of the comp function call now!

We have confirmed so far by bisection that the slowdown is caused by some changes in util.py and/or window.py, and there are additional benchmark runs in flight to narrow this down further. It is very likely that the slowdown is caused by time it takes to compare objects of some of the classes defined in window.py due to changes in implementation cmp or hash functions . I also plan to confirm it with a microbenchmark similar to https://github.com/apache/beam/compare/master...tvalentyn:utils_futurization_benchmark?expand=1#diff-de123c6d83f9809a6f0d95be5a7d1826. That could help us to get performance metrics for different implementations without running a slow benchmark suite.

I have pushed a new commit that should speed up the compare functions for BoundedWindow and TimestampedValue objects:

removed the total_ordering decorator

removed the custom cmp-method in BoundedWindow

removed the if-statement in TimestampedValue using the short-circuit behavior of or

All of these changes had a positive impact on a small test I used.

A microbenchmark in the apache beam code, would really be useful indeed :-)!

tvalentyn · 2018-07-23T15:39:36Z

I did not benchmark changes in c270644 yet, but doing bisection on previous version of the PR shows that the largest contributor to regression is the line from __future__ import division in util.py. I still don't understand the reason why it causes the regression. One place where we may have missed to modify division is https://github.com/Fematich/beam/blob/c2706447a6a602614c2f9bf36db1a666fa938819/sdks/python/apache_beam/transforms/util.py#L277 but changing division there to integer division does not seem to help. Looking further.

Fematich · 2018-07-24T22:28:00Z

@tvalentyn is there anything I can help with? Are you planning to benchmark the changes in c270644? Or do you first want to add more microbenchmarks? I am working on a microbenchmark for TimestampedValue #BEAM-4855

tvalentyn · 2018-07-24T23:04:15Z

With latest round of experiments, we finally got to the bottom of this performance regression, see: https://issues.apache.org/jira/browse/BEAM-4858. I will also put some details inline in util.py.

tvalentyn · 2018-07-24T23:13:33Z

sdks/python/apache_beam/transforms/util.py

                   key=div_keys)
    # Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-    threshold = 2 * len(pairs) / 3
+    threshold = 2 * len(pairs) // 3


Let's use past.utils.division.old_div in line 280 as an exception, and add a TODO(BEAM-4858) comment to clean this up.

I have confirmed that this change brings performance back to the same ballpark.

Perfect! I will update the PR, thx!

Thanks for your patience with this investigation.

tvalentyn · 2018-07-24T23:34:04Z

@Fematich I'm taking a look at c270644. I don't believe @total_ordering was an issue, but I'll see if the change makes a difference, I also started working on a microbenchmark but stopped pursuing that direction once I saw that window.py changes were not the main offender. I'll take a look at your microbenchmark as well.

Since we now know how to make a have Py3-compatible version of this change that performs comparably well, the rest of performance testing won't take much time.

tvalentyn

A few comments pertaining to 5b8842b.

tvalentyn · 2018-07-27T02:02:01Z

sdks/python/apache_beam/transforms/window.py

+
  def __hash__(self):
-    return hash(self.end)
+    raise NotImplementedError


What's the reason to change the original behavior of __hash__? Seems like we should revert this change since it makes the objects of this class unhashable.

I removed the implementation to match the __eq__ behavior which also raises NotImplemented. This to enforce the child classes to implement __hash__ and to make it impossible for child classes like GlobalWindow and IntervalWindow objects to have the same hash. Does this make sense, or should I add the __hash__ method back?

I see, that makes sense, thanks, I think it's a good idea to keep returning NotImplementedError as you suggested, since we don't implement __eq__.

Note that if a class that overrides __eq__() needs to retain the implementation of __hash__() from a parent class, the interpreter must be told this explicitly by setting __hash__ = <ParentClass>.__hash__, see: https://docs.python.org/3/reference/datamodel.html#object.__hash__ .

OK thanks, and interesting! So the NotImplementedError is only relevant for the consistency of the __eq__ but doesn't have an impact for child classes. Good to know :-)!

tvalentyn · 2018-07-27T02:21:57Z

sdks/python/apache_beam/transforms/window.py

+  def __hash__(self):
+    return hash((type(self), self.size, self.offset))
+
  def __ne__(self, other):


Looks like classes IntervalWindow, GlobalWindow, SlidingWindows, and Sessions define __eq__, but don't define __ne__.

Let's add:

def __ne__(self, other): return not self == other

since this would be the default implementation of __ne__ in Python 3.

Curious, why the conversion tool does not add something similar.

Adding

def __ne__(self, other): return not self == other

for IntervalWindow causes test_global_window to fail on the last assertion (comparing IntervalWindow (max-range) to GlobalWindow). To resolve this failed assertion I need to add type in __eq__:

def __hash__(self): return hash((self.start, self.end, type(self))) def __eq__(self, other): return (self.start == other.start and self.end == other.end) and type(self) == type(other)) def __ne__(self, other): return not self == other

I think this makes sense and will be necessary for Python3 compatibility, however I'm not sure if this will have performance impact here?

Agree with you, looking at the test it seems that we are doing the right thing. I think there will not be performance impact here, but I'll do one final A/B test with and without the PR to be safe once we finalize it.

Thanks! I have just pushed the commit with these changes.

tvalentyn · 2018-07-27T03:34:20Z

sdks/python/apache_beam/transforms/window.py

+    return (cmp(self.end, other.end)
+            or cmp(hash(self), hash(other))) != 0
+
+  def __lt__(self, other):


It seems to be a little faster if we don't pull in cmp. How about we implement the rich comparisons as follows:

# Order first by endpoint, then arbitrarily. <------ Let's mention this comment once. def __op__(self, other): if self.end != other.end: return self.end $op_symbol other.end return hash(self) $op_symbo hash(other)

tvalentyn · 2018-07-27T06:21:06Z

sdks/python/apache_beam/transforms/window.py

  def __eq__(self, other):
    raise NotImplementedError

+  def __ne__(self, other):


We can also remove cmp here:

def __ne__(self, other): return self.end != other.end or hash(self) != hash(other)

tvalentyn · 2018-07-27T06:24:27Z

sdks/python/apache_beam/transforms/window.py

-    if type(self) is not type(other):
-      return cmp(type(self), type(other))
-    return cmp((self.value, self.timestamp), (other.value, other.timestamp))
+  def __eq__(self, other):


I suggest:

Use @total_ordering.

Implement __ne__ as return not self == other

Implement __lt__ without cmp and tuples, which performs slightly better:

def __lt__(self, other): if type(self) != type(other): return type(self) < type(other) if self.value != other.value: return self.value < other.value return self.timestamp < other.timestamp

Using total_ordering results in unexpected behavior. Concretely the test test_reshuffle_windows_unchanged fails.

I have tried to locate the exact cause by implementing all OPs (with the total_ordering decorator in place) and subsequently leaving out the OPs one by one:

adding the total_ordering decorator itself doesn't introduce issues

only use total_ordering to fill in __lt__ works, using other combinations always fail.

I am currently testing the OPs by manually using the conversion rules defined by total_ordering to see if I can locate the exact problem:

__ge__ can be removed and works with functools

__gt__ work from manual copy functools from lt, but not with functools

__le__ doesn't work from manual copy functools from lt, however replacing and self==other by and not (self != other) works.

@tvalentyn: Given the note on performance impact of the total_ordering decorator, it might make sense to implement all OPs instead of using the decorator? That works already, in the meantime I will continue the testing (step3) to see if I can give more info.

With this PR, the test becomes flaky, or in other words passes sometimes. It may still flake if we implement all ops manually - did you try running the test multiple times when all ops are implemented?

I don't understand yet what change in behavior triggers this (we should find out), but I think we need to fix the test regardless of this: #6104.

Performance-wise, last week I used master...tvalentyn:transforms_microbenchmark to compare different options, and did not notice a significant difference when using @total_ordering (we could double check), so I favored the decorator to reduce the boilerplate.

Yes, I just retested with the full implementation which seems to work again. However, will be good to test the @total_ordering after your PR has been merged :-).

I checked performance of windowed_value, interval_window, timestamped_value, bounded_window in dictionaries and ordered lists, with and without this PR. For the most part, performance is not changed or improved. @total_ordering does not significantly affect it. Only concern is using hash(type(self)) when evaluating hashes of objects may be unnecessary in most cases, and slightly decreases the performance here: https://github.com/apache/beam/pull/5729/files#diff-d7dfd884622fb59806ba9276cf3bd8fbR242. So I left some more comments to simplify hash functions. The change above was also the trigger for test flakiness, although ultimately the test was at fault.

Without PR:

wv_with_one_window: dict, 10000 element(s) : per element median time cost: 4.71699e-06 sec, relative std: 5.93% wv_with_multiple_windows: dict, 10000 element(s): per element median time cost: 4.02698e-05 sec, relative std: 0.60% interval_window: dict, 10000 element(s) : per element median time cost: 1.5276e-06 sec, relative std: 1.78% timestamped_value: dict, 10000 element(s) : per element median time cost: 1.39499e-07 sec, relative std: 7.44% interval_window: sorting., 10000 element(s) : per element median time cost: 4.04392e-05 sec, relative std: 0.63% timestamped_value: sorting., 10000 element(s) : per element median time cost: 1.80363e-05 sec, relative std: 1.35% bounded_window: sorting., 10000 element(s) : per element median time cost: 4.06633e-05 sec, relative std: 1.26%

With PR (including the change suggested in last iteration).

wv_with_one_window: dict, 10000 element(s) : per element median time cost: 5.047e-06 sec, relative std: 2.16% wv_with_multiple_windows: dict, 10000 element(s): per element median time cost: 4.0575e-05 sec, relative std: 2.20% interval_window: dict, 10000 element(s) : per element median time cost: 1.53821e-06 sec, relative std: 2.43% timestamped_value: dict, 10000 element(s) : per element median time cost: 1.27995e-06 sec, relative std: 6.11% interval_window: sorting., 10000 element(s) : per element median time cost: 1.83087e-05 sec, relative std: 1.28% timestamped_value: sorting., 10000 element(s) : per element median time cost: 8.4375e-06 sec, relative std: 2.62% bounded_window: sorting., 10000 element(s) : per element median time cost: 1.80462e-05 sec, relative std: 3.56%

Fematich · 2018-08-01T19:47:33Z

Run Python Dataflow ValidatesRunner

tvalentyn

Thanks, @Fematich. Did another pass over the PR, two minor comments.

tvalentyn · 2018-08-01T18:46:10Z

sdks/python/apache_beam/transforms/cy_combiners.py

      if self.sum >= INT64_MAX:
        self.sum -= 2**64
-    return self.sum / self.count if self.count else _NAN
+    return self.sum // self.count if self.count else _NAN


Please also make the change in line 266.

tvalentyn · 2018-08-01T20:56:48Z

sdks/python/apache_beam/transforms/window.py

+  def __ne__(self, other):
+    return not self == other
+
+  def __lt__(self, other):


Since types are not comparable in Python 3, how about we change the implementation to:

def __lt__(self, other): if type(self) != type(other): return type(self).__name__ < type(other).__name__ if self.value != other.value: return self.value < other.value return self.timestamp < other.timestamp

tvalentyn · 2018-08-01T23:38:19Z

sdks/python/apache_beam/transforms/window.py


  def __hash__(self):
-    return hash((self.start, self.end))
+    return hash((self.start, self.end, type(self)))


Let's remove type(self) from the tuple.

tvalentyn · 2018-08-01T23:41:11Z

sdks/python/apache_beam/transforms/core.py

    return False

+  def __hash__(self):
+    return hash((type(self), self.param_id))


let's simplify this to hash(self.param_id).

tvalentyn · 2018-08-01T23:44:29Z

sdks/python/apache_beam/transforms/core.py

    return False

+  def __hash__(self):
+    return hash((type(self), self.windowfn, self.accumulation_mode,


Since this was not defined, most likely this will be dead code, and current implementation may break the contract with __eq__ since it's not taking self._default into account, let's make it __hash__ = None.

Unfortunately removing __hash__ here, results in TypeError: unhashable type: 'Windowing' for test_top_prefixes.

However, looking at self._default, it seems OK to remove it from the hash, since it's actually a check on the other class variables. Therefore I think __eq__ and __hash__ are still in sync. I removed type(self) as well.

You are right, self._default indeed is defined based on values of other object attributes included in hash.

Objects of this class are used in this dictionary:

beam/sdks/python/apache_beam/runners/pipeline_context.py

Line 78 in b8499d9

self._obj_to_id[obj] = id

. Current implementation makes sense to me.

tvalentyn · 2018-08-01T23:45:48Z

sdks/python/apache_beam/transforms/trigger.py

    return type(self) == type(other) and self.count == other.count

+  def __hash__(self):
+    return hash((type(self), self.count))


Let's simplify this to return hash(self.count).

tvalentyn · 2018-08-01T23:46:26Z

sdks/python/apache_beam/transforms/trigger.py

    return type(self) == type(other) and self.underlying == other.underlying

+  def __hash__(self):
+    return hash((type(self), self.underlying))


Let's remove type(self) from the tuple.

tvalentyn · 2018-08-01T23:48:21Z

sdks/python/apache_beam/transforms/window.py

+            and self.timestamp == other.timestamp)
+
+  def __hash__(self):
+    return hash((type(self), self.value, self.timestamp))


Let's remove type(self) from the tuple.

tvalentyn · 2018-08-01T23:49:12Z

sdks/python/apache_beam/transforms/window.py

      return self.size == other.size and self.offset == other.offset

+  def __hash__(self):
+    return hash((type(self), self.size, self.offset))


Let's remove type(self) from the tuple.

tvalentyn · 2018-08-01T23:49:52Z

sdks/python/apache_beam/transforms/window.py

+    return not self == other
+
+  def __hash__(self):
+    return hash((type(self), self.offset, self.period))


Let's remove type(self) from the tuple.

tvalentyn · 2018-08-01T23:50:25Z

sdks/python/apache_beam/transforms/window.py

+    return not self == other
+
+  def __hash__(self):
+    return hash((type(self), self.gap_size))


Let's remove type(self) from the tuple.

tvalentyn · 2018-08-02T00:07:31Z

sdks/python/apache_beam/transforms/window.py

-    if type(self) is not type(other):
-      return cmp(type(self), type(other))
-    return cmp((self.value, self.timestamp), (other.value, other.timestamp))
+  def __eq__(self, other):


I checked performance of windowed_value, interval_window, timestamped_value, bounded_window in dictionaries and ordered lists, with and without this PR. For the most part, performance is not changed or improved. @total_ordering does not significantly affect it. Only concern is using hash(type(self)) when evaluating hashes of objects may be unnecessary in most cases, and slightly decreases the performance here: https://github.com/apache/beam/pull/5729/files#diff-d7dfd884622fb59806ba9276cf3bd8fbR242. So I left some more comments to simplify hash functions. The change above was also the trigger for test flakiness, although ultimately the test was at fault.

Without PR:

wv_with_one_window: dict, 10000 element(s) : per element median time cost: 4.71699e-06 sec, relative std: 5.93% wv_with_multiple_windows: dict, 10000 element(s): per element median time cost: 4.02698e-05 sec, relative std: 0.60% interval_window: dict, 10000 element(s) : per element median time cost: 1.5276e-06 sec, relative std: 1.78% timestamped_value: dict, 10000 element(s) : per element median time cost: 1.39499e-07 sec, relative std: 7.44% interval_window: sorting., 10000 element(s) : per element median time cost: 4.04392e-05 sec, relative std: 0.63% timestamped_value: sorting., 10000 element(s) : per element median time cost: 1.80363e-05 sec, relative std: 1.35% bounded_window: sorting., 10000 element(s) : per element median time cost: 4.06633e-05 sec, relative std: 1.26%

With PR (including the change suggested in last iteration).

wv_with_one_window: dict, 10000 element(s) : per element median time cost: 5.047e-06 sec, relative std: 2.16% wv_with_multiple_windows: dict, 10000 element(s): per element median time cost: 4.0575e-05 sec, relative std: 2.20% interval_window: dict, 10000 element(s) : per element median time cost: 1.53821e-06 sec, relative std: 2.43% timestamped_value: dict, 10000 element(s) : per element median time cost: 1.27995e-06 sec, relative std: 6.11% interval_window: sorting., 10000 element(s) : per element median time cost: 1.83087e-05 sec, relative std: 1.28% timestamped_value: sorting., 10000 element(s) : per element median time cost: 8.4375e-06 sec, relative std: 2.62% bounded_window: sorting., 10000 element(s) : per element median time cost: 1.80462e-05 sec, relative std: 3.56%

tvalentyn

LGTM, thank you.

aaltay · 2018-08-03T20:47:23Z

Thank you @Fematich and all the reviewers!

Fematich force-pushed the transforms branch from e55438a to 6db3d2d Compare June 26, 2018 11:21

superbobry suggested changes Jun 28, 2018

View reviewed changes

charlesccychen reviewed Jul 3, 2018

View reviewed changes

Fematich force-pushed the transforms branch 2 times, most recently from a317e92 to c386443 Compare July 8, 2018 22:06

Fematich force-pushed the transforms branch from c386443 to 1fca787 Compare July 8, 2018 22:19

charlesccychen reviewed Jul 9, 2018

View reviewed changes

Fematich force-pushed the transforms branch from 1fca787 to aaf5058 Compare July 9, 2018 15:27

charlesccychen reviewed Jul 9, 2018

View reviewed changes

Fematich force-pushed the transforms branch 3 times, most recently from 9ed0884 to f8aea37 Compare July 10, 2018 12:08

Fematich commented Jul 12, 2018

View reviewed changes

Fematich force-pushed the transforms branch 2 times, most recently from 209f684 to f6dc113 Compare July 12, 2018 11:18

superbobry suggested changes Jul 13, 2018

View reviewed changes

Fematich force-pushed the transforms branch from f6dc113 to 5ca9aad Compare July 20, 2018 16:17

Fematich requested review from aaltay and pabloem as code owners July 20, 2018 16:17

Fematich force-pushed the transforms branch 2 times, most recently from 2e86095 to c270644 Compare July 20, 2018 17:07

Fematich mentioned this pull request Jul 24, 2018

[BEAM-4855] Added windowinto_microbenchmark #6051

Closed

tvalentyn reviewed Jul 24, 2018

View reviewed changes

Fematich force-pushed the transforms branch from c270644 to cdef61e Compare July 25, 2018 00:02

tvalentyn reviewed Jul 27, 2018

View reviewed changes

Fematich force-pushed the transforms branch from cdef61e to 3852f0f Compare July 27, 2018 22:32

tvalentyn mentioned this pull request Jul 31, 2018

Don't rely on order of elements in a PCollection after GBK in test_reshuffle_windows_unchanged #6104

Merged

Fematich and others added 6 commits August 1, 2018 13:49

Futurize transforms subpackage

a8618f1

incorporated all feedback for futurize transforms subpackage

924c447

added ifilter for PY2

6b80c40

remove total ordering and cmp method

0092d4a

added old_div to fix performance regression in PY2

94cfda0

added __ne__ and total_ordering

d848091

Fematich force-pushed the transforms branch from 3852f0f to d848091 Compare August 1, 2018 11:49

tvalentyn requested changes Aug 1, 2018

View reviewed changes

tvalentyn requested changes Aug 2, 2018

View reviewed changes

remove type from __hash__

a224c58

tvalentyn approved these changes Aug 3, 2018

View reviewed changes

aaltay approved these changes Aug 3, 2018

View reviewed changes

aaltay merged commit 70259e1 into apache:master Aug 3, 2018

		min(self.start, other.start), max(self.end, other.end))


		@total_ordering

[BEAM-4006] Futurize transforms subpackage #5729

[BEAM-4006] Futurize transforms subpackage #5729

Uh oh!

Conversation

Fematich commented Jun 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

superbobry left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fematich Jul 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charlesccychen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fematich Jul 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fematich Jul 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Fematich commented Jun 21, 2018 •

edited

Loading

Fematich Jul 7, 2018 •

edited

Loading

Fematich Jul 4, 2018 •

edited

Loading

Fematich Jul 7, 2018 •

edited

Loading