From 9a04ba926e36e1ba15c456c66f4b41c0143b1c50 Mon Sep 17 00:00:00 2001 From: Saavan Date: Thu, 20 Aug 2020 19:56:33 -0500 Subject: [PATCH 1/6] Add myself to authors --- website/www/site/data/authors.yml | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/website/www/site/data/authors.yml b/website/www/site/data/authors.yml index a5c966d51158..0a9740b58909 100644 --- a/website/www/site/data/authors.yml +++ b/website/www/site/data/authors.yml @@ -160,4 +160,8 @@ pedro: rionmonster: name: Rion Williams email: rionmonster@gmail.com - twitter: rionmonster \ No newline at end of file + twitter: rionmonster +saavannanavati: + name: Saavan Nanavati + email: saavan.nanavati@utexas.edu + twitter: \ No newline at end of file From b7a52b2f5467348fb468b2e34f176410546bfd8b Mon Sep 17 00:00:00 2001 From: Saavan Date: Thu, 20 Aug 2020 22:06:08 -0500 Subject: [PATCH 2/6] Add blog post #1: improved annotation support --- .../en/blog/python-improved-annotations.md | 109 ++++++++++++++++++ 1 file changed, 109 insertions(+) create mode 100644 website/www/site/content/en/blog/python-improved-annotations.md diff --git a/website/www/site/content/en/blog/python-improved-annotations.md b/website/www/site/content/en/blog/python-improved-annotations.md new file mode 100644 index 000000000000..fdf25e7a082a --- /dev/null +++ b/website/www/site/content/en/blog/python-improved-annotations.md @@ -0,0 +1,109 @@ +--- +layout: post +title: "Improved Annotation Support for the Python SDK" +date: 2020-08-21 00:00:01 -0800 +categories: + - blog + - python + - typing +authors: + - saavan +--- + + +The importance of static type checking in a dynamically +typed language like Python is not up for debate. Type hints +allow developers to leverage a strong typing system to: + - write better code, + - self-document ambiguous programming logic, and + - inform intelligent code completion in IDEs like PyCharm. + +This is why we're excited to announce upcoming improvements to +the `typehints` module of Beam's Python SDK, including support +for typed PCollections and Python 3 style annotations on PTransforms. + +# Improved Annotations +Today, you have the option to declare type hints on PTransforms using either +class decorators or inline functions. + +For instance, a PTransform with decorated type hints might look like this: +``` +@beam.typehints.with_input_types(int) +@beam.typehints.with_output_types(str) +class IntToStr(beam.PTransform): + def expand(self, pcoll): + return pcoll | beam.Map(lambda num: str(num)) + +strings = numbers | beam.ParDo(IntToStr()) +``` + +Using inline functions instead, the same transform would look like this: +``` +class IntToStr(beam.PTransform): + def expand(self, pcoll): + return pcoll | beam.Map(lambda num: str(num)) + +strings = numbers | beam.ParDo(IntToStr()).with_input_types(int).with_output_types(str) +``` + +Both methods have problems. Class decorators are syntax-heavy, +requiring two additional lines of code, whereas inline functions provide type hints +that aren't reusable across other instances of the same transform. Additionally, both +methods are incompatible with static type checkers like MyPy. + +With Python 3 annotations however, we can subvert these problems to provide a +clean and reusable type hint experience. Our previous transform now looks like this: +``` +class IntToStr(beam.PTransform): + def expand(self, pcoll: PCollection[int]) -> PCollection[str]: + return pcoll | beam.Map(lambda num: str(num)) + +strings = numbers | beam.ParDo(IntToStr()) +``` + +These type hints will actively hook into the internal Beam typing system to +play a role in pipeline type checking, and runtime type checking. + +So how does this work? + +## Typed PCollections +You guessed it! The PCollection class inherits from `typing.Generic`, allowing it to be +parameterized with either zero types (denoted `PCollection`) or one type (denoted `PCollection[T]`). +- A PCollection with zero types is implicitly converted to `PCollection[any]`. +- A PCollection with one type can have any nested type (e.g. `Union[int, str]`). + +Internally, Beam's typing system makes these annotations compatible with other +type hints by removing the outer PCollection container. + +## PBegin, PDone, None +Finally, besides PCollection, a valid annotation on the `expand(...)` method of a PTransform is +`PBegin`, `PDone`, and `None`. These are generally used for I/O operations. + +For instance, when saving data, your transform's output type should be `None`. +``` +class SaveResults(beam.PTransform): + def expand(self, pcoll: PCollection[str]) -> None: + return pcoll | beam.io.WriteToBigQuery(...) +``` + +# Next Steps +What are you waiting for.. start using annotations on your transforms! + +For more background on type hints in Python, see: +[Ensuring Python Type Safety](https://beam.apache.org/documentation/sdks/python-type-safety/). + +Finally, please +[let us know](https://beam.apache.org/community/contact-us/) +if you encounter any issues. From 22e5d46d0ae7069f17c7c0cff41c90a96681b65c Mon Sep 17 00:00:00 2001 From: Saavan Date: Thu, 20 Aug 2020 23:16:57 -0500 Subject: [PATCH 3/6] Add draft of blog post #2: performance runtime type checking --- ...ython-performance-runtime-type-checking.md | 122 ++++++++++++++++++ 1 file changed, 122 insertions(+) create mode 100644 website/www/site/content/en/blog/python-performance-runtime-type-checking.md diff --git a/website/www/site/content/en/blog/python-performance-runtime-type-checking.md b/website/www/site/content/en/blog/python-performance-runtime-type-checking.md new file mode 100644 index 000000000000..1e513fe6af65 --- /dev/null +++ b/website/www/site/content/en/blog/python-performance-runtime-type-checking.md @@ -0,0 +1,122 @@ +--- +layout: post +title: "Performance-Driven Runtime Type Checking for the Python SDK" +date: 2020-08-21 00:00:01 -0800 +categories: + - blog + - python + - typing +authors: + - saavan +--- + + +In this blog post, we are announcing a new, opt-in performance-driven runtime type check +for an upcoming release of Beam's Python SDK. + +But let's take a step back - why do we even care about runtime type-checking? Let's look at an example. + +``` +class MultiplyNumberByTwo(beam.DoFn): + def process(self, element: int): + return element * 2 + +p = Pipeline() +p | beam.Create(['1', '2'] | beam.ParDo(MultiplyNumberByTwo()) +``` + +In this code, we passed a list of strings to a DoFn that's clearly intended for processing +integers. Luckily, this code will throw an error during pipeline construction because +the inferred output type of `beam.Create(['1', '2'])` is `str` which is incompatible with +the declared input type hint of `MultiplyNumberByTwo.process` which is `int`. + +However, what if we turned this pipeline type check off using the `no_pipeline_type_check` +flag? Or more realistically, what if the input data to MultiplyNumberByTwo is coming +from a database, preventing inference of the output data type? + +In either case, no error would be thrown during pipeline construction. +Even at runtime, this code works. Each string would be multiplied by 2, +yielding a result of `['11', '22']`, but that's certainly not the outcome we want. + +So how do you debug this breed of "hidden" errors? More broadly speaking, how do you +debug any error in Beam with a complex or confusing error message? + +The answer is runtime type-checking. + +# Runtime Type Checking +This feature works by checking that actual input and output values satisfy the declared +type constraints during pipeline execution. If you ran the code from before with RTC on, +you would receive the following error message: + +``` +Type hint violation for 'ParDo(MultiplyByTwo)': requires but got for element +``` + +This is an actionable error message - it tells you that either your code has a bug +or that your declared type hints are incorrect. Sounds simple enough, so what's the catch? + +_It is soooo slowwwwww._ See for yourself. + + +| Element Size | Normal Pipeline | Runtime Type Checking Pipeline +| ------------ | --------------- | ------------------------------ +| 1 | 5.3 sec | 5.6 sec +| 2,001 | 9.4 sec | 57.2 sec +| 10,001 | 24.5 sec | 259.8 sec +| 18,001 | 38.7 sec | 450.5 sec + +In this micro-benchmark, the pipeline with runtime type checking was over 10x slower, +with the gap only increasing as our input PCollection increased in size. + +So, is there any production-friendly alternative for runtime type-checking? + +# Performance Runtime Type Check +There is! We developed a new flag called `performance_runtime_type_check` that achieves +blazingly fast speeds using a combination of +- efficient Cython code, +- smart sampling techniques, and +- optimized mega type-hints. + +So what do the new numbers look like? + +| Element Size | Normal | RTC | Performance RTC +| ----------- | --------- | ---------- | --------------- +| 1 | 5.3 sec | 5.6 sec | 5.4 sec +| 2,001 | 9.4 sec | 57.2 sec | 11.2 sec +| 10,001 | 24.5 sec | 259.8 sec | 25.5 sec +| 18,001 | 38.7 sec | 450.5 sec | 39.4 sec + +On average, the new performance runtime type check is 4.4% slower than a +normal pipeline whereas the old runtime type check is over 900% slower! + +## How does it work? +There are three key factors responsible for this upgrade in performance. + +First, sampling. + +Second, Cython. + +Finally, we use a single mega type hint to type-check only the output values of transforms +instead of type-checking the input and output values separately. The set of constraints that +form this mega typehint are the producer transform's output type constraints along with +all producer transforms' input type constraints. Using this mega type hint allows us to reduce +overhead while simultaneously allowing us to throw _more actionable errors_. + +# Next Steps +Play around with the new `performance_runtime_type_check` feature! + +It's in an experimental state so please +[let us know](https://beam.apache.org/community/contact-us/) +if you encounter any issues. From f9973aafc15a5a0b396ceb22abbcc715e2b0d962 Mon Sep 17 00:00:00 2001 From: Saavan Nanavati Date: Fri, 21 Aug 2020 01:51:46 -0500 Subject: [PATCH 4/6] Finish blog post #2 --- ...ython-performance-runtime-type-checking.md | 82 +++++++++++++------ 1 file changed, 57 insertions(+), 25 deletions(-) diff --git a/website/www/site/content/en/blog/python-performance-runtime-type-checking.md b/website/www/site/content/en/blog/python-performance-runtime-type-checking.md index 1e513fe6af65..9477ccc19ece 100644 --- a/website/www/site/content/en/blog/python-performance-runtime-type-checking.md +++ b/website/www/site/content/en/blog/python-performance-runtime-type-checking.md @@ -23,10 +23,12 @@ See the License for the specific language governing permissions and limitations under the License. --> -In this blog post, we are announcing a new, opt-in performance-driven runtime type check -for an upcoming release of Beam's Python SDK. +In this blog post, we're announcing the upcoming release of a new, opt-in +runtime type checking system for Beam's Python SDK that's optimized for performance +in both development and production environments. -But let's take a step back - why do we even care about runtime type-checking? Let's look at an example. +But let's take a step back - why do we even care about runtime type checking +in the first place? Let's look at an example. ``` class MultiplyNumberByTwo(beam.DoFn): @@ -37,28 +39,28 @@ p = Pipeline() p | beam.Create(['1', '2'] | beam.ParDo(MultiplyNumberByTwo()) ``` -In this code, we passed a list of strings to a DoFn that's clearly intended for processing +In this code, we passed a list of strings to a DoFn that's clearly intended for use with integers. Luckily, this code will throw an error during pipeline construction because the inferred output type of `beam.Create(['1', '2'])` is `str` which is incompatible with the declared input type hint of `MultiplyNumberByTwo.process` which is `int`. -However, what if we turned this pipeline type check off using the `no_pipeline_type_check` -flag? Or more realistically, what if the input data to MultiplyNumberByTwo is coming +However, what if we turned the pipeline type check off using the `no_pipeline_type_check` +flag? Or more realistically, what if the input PCollection to MultiplyNumberByTwo came from a database, preventing inference of the output data type? In either case, no error would be thrown during pipeline construction. -Even at runtime, this code works. Each string would be multiplied by 2, +And even at runtime, this code works. Each string would be multiplied by 2, yielding a result of `['11', '22']`, but that's certainly not the outcome we want. So how do you debug this breed of "hidden" errors? More broadly speaking, how do you -debug any error in Beam with a complex or confusing error message? +debug any error message in Beam that's complex or confusing (e.g. serialization errors)? -The answer is runtime type-checking. +The answer is to use runtime type checking. -# Runtime Type Checking +# Runtime Type Checking (RTC) This feature works by checking that actual input and output values satisfy the declared -type constraints during pipeline execution. If you ran the code from before with RTC on, -you would receive the following error message: +type constraints during pipeline execution. If you ran the code from before with +`runtime_type_check` on, you would receive the following error message: ``` Type hint violation for 'ParDo(MultiplyByTwo)': requires but got for element @@ -80,11 +82,11 @@ _It is soooo slowwwwww._ See for yourself. In this micro-benchmark, the pipeline with runtime type checking was over 10x slower, with the gap only increasing as our input PCollection increased in size. -So, is there any production-friendly alternative for runtime type-checking? +So, is there any production-friendly alternative? # Performance Runtime Type Check -There is! We developed a new flag called `performance_runtime_type_check` that achieves -blazingly fast speeds using a combination of +There is! We developed a new flag called `performance_runtime_type_check` that +minimizes its footprint on the pipeline's time complexity using a combination of - efficient Cython code, - smart sampling techniques, and - optimized mega type-hints. @@ -98,24 +100,54 @@ So what do the new numbers look like? | 10,001 | 24.5 sec | 259.8 sec | 25.5 sec | 18,001 | 38.7 sec | 450.5 sec | 39.4 sec -On average, the new performance runtime type check is 4.4% slower than a -normal pipeline whereas the old runtime type check is over 900% slower! +On average, the new Performance RTC is 4.4% slower than a normal pipeline whereas the old RTC +is over 900% slower! Additionally, as the size of the input PCollection increases, the fixed cost +of setting up the Performance RTC system is spread across each element, decreasing the relative +impact on the overall pipeline. With 18,001 elements, the difference is less than 1 second. ## How does it work? There are three key factors responsible for this upgrade in performance. -First, sampling. +1. Instead of type checking all values, we only type check a subset of values, known as +a sample in statistics. Initially, we sample a substantial number of elements, but as our +confidence that the element type won't change over time increases, we reduce our +sampling rate (up to a fixed minimum). + +2. Whereas the old RTC system used heavy decorators to perform the type check, the new RTC system +moves the type check to a Cython-optimized, non-decorated portion of the codebase. For reference, +Cython is a programming language that gives C-like performance to Python code. + +3. Finally, we use a single mega type hint to type-check only the output values of transforms +instead of type-checking both the input and output values separately. This mega typehint is composed of +the original transform's output type constraints along with all consumer transforms' input type +constraints. Using this mega type hint allows us to reduce overhead while simultaneously allowing +us to throw _more actionable errors_. For instance, consider the following error (which was +generated from the old RTC system): +``` +Runtime type violation detected within ParDo(DownstreamDoFn): Type-hint for argument: 'element' violated. Expected an instance of , instead found 9, an instance of . +``` + +This error tells us that the `DownstreamDoFn` received an `int` when it was expecting a `str`, but doesn't tell us +who created that `int` in the first place. Who is the offending upstream transform that's responsible for +this `int`? Presumably, _that_ transform's output type hints were too expansive (e.g. `any`) or otherwise non-existent because +no error was thrown during the runtime type check of its output. -Second, Cython. +The problem here boils down to a lack of context. If we knew who our consumers were when type +checking our output, we could simultaneously type check our output value against our output type +constraints and every consumers' input type constraints to know whether there is _any_ possibility +for a mismatch. This is exactly what the mega type hint does, and it allows us to throw errors +at the point of declaration rather than the point of exception, saving you valuable time +while providing higher quality error messages. + +So what would the same error look like using Performance RTC? It's the exact same string but with one additional line: +``` +[while running 'ParDo(UpstreamDoFn)'] +``` -Finally, we use a single mega type hint to type-check only the output values of transforms -instead of type-checking the input and output values separately. The set of constraints that -form this mega typehint are the producer transform's output type constraints along with -all producer transforms' input type constraints. Using this mega type hint allows us to reduce -overhead while simultaneously allowing us to throw _more actionable errors_. +And that's much more actionable for an investigation :) # Next Steps -Play around with the new `performance_runtime_type_check` feature! +Go play with the new `performance_runtime_type_check` feature! It's in an experimental state so please [let us know](https://beam.apache.org/community/contact-us/) From 587a20ab0b7a2097399ad4cd336c0ab1514096da Mon Sep 17 00:00:00 2001 From: Saavan Nanavati Date: Sat, 22 Aug 2020 14:23:14 -0500 Subject: [PATCH 5/6] Remove white space --- .../en/blog/python-improved-annotations.md | 40 ++++++++-------- ...ython-performance-runtime-type-checking.md | 48 +++++++++---------- 2 files changed, 44 insertions(+), 44 deletions(-) diff --git a/website/www/site/content/en/blog/python-improved-annotations.md b/website/www/site/content/en/blog/python-improved-annotations.md index fdf25e7a082a..3f7e077534ac 100644 --- a/website/www/site/content/en/blog/python-improved-annotations.md +++ b/website/www/site/content/en/blog/python-improved-annotations.md @@ -3,8 +3,8 @@ layout: post title: "Improved Annotation Support for the Python SDK" date: 2020-08-21 00:00:01 -0800 categories: - - blog - - python + - blog + - python - typing authors: - saavan @@ -23,15 +23,15 @@ See the License for the specific language governing permissions and limitations under the License. --> -The importance of static type checking in a dynamically -typed language like Python is not up for debate. Type hints +The importance of static type checking in a dynamically +typed language like Python is not up for debate. Type hints allow developers to leverage a strong typing system to: - - write better code, - - self-document ambiguous programming logic, and + - write better code, + - self-document ambiguous programming logic, and - inform intelligent code completion in IDEs like PyCharm. -This is why we're excited to announce upcoming improvements to -the `typehints` module of Beam's Python SDK, including support +This is why we're excited to announce upcoming improvements to +the `typehints` module of Beam's Python SDK, including support for typed PCollections and Python 3 style annotations on PTransforms. # Improved Annotations @@ -58,12 +58,12 @@ class IntToStr(beam.PTransform): strings = numbers | beam.ParDo(IntToStr()).with_input_types(int).with_output_types(str) ``` -Both methods have problems. Class decorators are syntax-heavy, -requiring two additional lines of code, whereas inline functions provide type hints -that aren't reusable across other instances of the same transform. Additionally, both +Both methods have problems. Class decorators are syntax-heavy, +requiring two additional lines of code, whereas inline functions provide type hints +that aren't reusable across other instances of the same transform. Additionally, both methods are incompatible with static type checkers like MyPy. -With Python 3 annotations however, we can subvert these problems to provide a +With Python 3 annotations however, we can subvert these problems to provide a clean and reusable type hint experience. Our previous transform now looks like this: ``` class IntToStr(beam.PTransform): @@ -74,17 +74,17 @@ strings = numbers | beam.ParDo(IntToStr()) ``` These type hints will actively hook into the internal Beam typing system to -play a role in pipeline type checking, and runtime type checking. +play a role in pipeline type checking, and runtime type checking. So how does this work? ## Typed PCollections -You guessed it! The PCollection class inherits from `typing.Generic`, allowing it to be -parameterized with either zero types (denoted `PCollection`) or one type (denoted `PCollection[T]`). +You guessed it! The PCollection class inherits from `typing.Generic`, allowing it to be +parameterized with either zero types (denoted `PCollection`) or one type (denoted `PCollection[T]`). - A PCollection with zero types is implicitly converted to `PCollection[any]`. - A PCollection with one type can have any nested type (e.g. `Union[int, str]`). -Internally, Beam's typing system makes these annotations compatible with other +Internally, Beam's typing system makes these annotations compatible with other type hints by removing the outer PCollection container. ## PBegin, PDone, None @@ -102,8 +102,8 @@ class SaveResults(beam.PTransform): What are you waiting for.. start using annotations on your transforms! For more background on type hints in Python, see: -[Ensuring Python Type Safety](https://beam.apache.org/documentation/sdks/python-type-safety/). +[Ensuring Python Type Safety](https://beam.apache.org/documentation/sdks/python-type-safety/). -Finally, please -[let us know](https://beam.apache.org/community/contact-us/) -if you encounter any issues. +Finally, please +[let us know](https://beam.apache.org/community/contact-us/) +if you encounter any issues. diff --git a/website/www/site/content/en/blog/python-performance-runtime-type-checking.md b/website/www/site/content/en/blog/python-performance-runtime-type-checking.md index 9477ccc19ece..ef52691b9725 100644 --- a/website/www/site/content/en/blog/python-performance-runtime-type-checking.md +++ b/website/www/site/content/en/blog/python-performance-runtime-type-checking.md @@ -3,8 +3,8 @@ layout: post title: "Performance-Driven Runtime Type Checking for the Python SDK" date: 2020-08-21 00:00:01 -0800 categories: - - blog - - python + - blog + - python - typing authors: - saavan @@ -23,11 +23,11 @@ See the License for the specific language governing permissions and limitations under the License. --> -In this blog post, we're announcing the upcoming release of a new, opt-in -runtime type checking system for Beam's Python SDK that's optimized for performance +In this blog post, we're announcing the upcoming release of a new, opt-in +runtime type checking system for Beam's Python SDK that's optimized for performance in both development and production environments. -But let's take a step back - why do we even care about runtime type checking +But let's take a step back - why do we even care about runtime type checking in the first place? Let's look at an example. ``` @@ -44,12 +44,12 @@ integers. Luckily, this code will throw an error during pipeline construction be the inferred output type of `beam.Create(['1', '2'])` is `str` which is incompatible with the declared input type hint of `MultiplyNumberByTwo.process` which is `int`. -However, what if we turned the pipeline type check off using the `no_pipeline_type_check` -flag? Or more realistically, what if the input PCollection to MultiplyNumberByTwo came +However, what if we turned the pipeline type check off using the `no_pipeline_type_check` +flag? Or more realistically, what if the input PCollection to MultiplyNumberByTwo came from a database, preventing inference of the output data type? -In either case, no error would be thrown during pipeline construction. -And even at runtime, this code works. Each string would be multiplied by 2, +In either case, no error would be thrown during pipeline construction. +And even at runtime, this code works. Each string would be multiplied by 2, yielding a result of `['11', '22']`, but that's certainly not the outcome we want. So how do you debug this breed of "hidden" errors? More broadly speaking, how do you @@ -59,14 +59,14 @@ The answer is to use runtime type checking. # Runtime Type Checking (RTC) This feature works by checking that actual input and output values satisfy the declared -type constraints during pipeline execution. If you ran the code from before with +type constraints during pipeline execution. If you ran the code from before with `runtime_type_check` on, you would receive the following error message: ``` Type hint violation for 'ParDo(MultiplyByTwo)': requires but got for element ``` -This is an actionable error message - it tells you that either your code has a bug +This is an actionable error message - it tells you that either your code has a bug or that your declared type hints are incorrect. Sounds simple enough, so what's the catch? _It is soooo slowwwwww._ See for yourself. @@ -79,7 +79,7 @@ _It is soooo slowwwwww._ See for yourself. | 10,001 | 24.5 sec | 259.8 sec | 18,001 | 38.7 sec | 450.5 sec -In this micro-benchmark, the pipeline with runtime type checking was over 10x slower, +In this micro-benchmark, the pipeline with runtime type checking was over 10x slower, with the gap only increasing as our input PCollection increased in size. So, is there any production-friendly alternative? @@ -109,19 +109,19 @@ impact on the overall pipeline. With 18,001 elements, the difference is less tha There are three key factors responsible for this upgrade in performance. 1. Instead of type checking all values, we only type check a subset of values, known as -a sample in statistics. Initially, we sample a substantial number of elements, but as our -confidence that the element type won't change over time increases, we reduce our +a sample in statistics. Initially, we sample a substantial number of elements, but as our +confidence that the element type won't change over time increases, we reduce our sampling rate (up to a fixed minimum). 2. Whereas the old RTC system used heavy decorators to perform the type check, the new RTC system -moves the type check to a Cython-optimized, non-decorated portion of the codebase. For reference, +moves the type check to a Cython-optimized, non-decorated portion of the codebase. For reference, Cython is a programming language that gives C-like performance to Python code. 3. Finally, we use a single mega type hint to type-check only the output values of transforms instead of type-checking both the input and output values separately. This mega typehint is composed of -the original transform's output type constraints along with all consumer transforms' input type +the original transform's output type constraints along with all consumer transforms' input type constraints. Using this mega type hint allows us to reduce overhead while simultaneously allowing -us to throw _more actionable errors_. For instance, consider the following error (which was +us to throw _more actionable errors_. For instance, consider the following error (which was generated from the old RTC system): ``` Runtime type violation detected within ParDo(DownstreamDoFn): Type-hint for argument: 'element' violated. Expected an instance of , instead found 9, an instance of . @@ -130,18 +130,18 @@ Runtime type violation detected within ParDo(DownstreamDoFn): Type-hint for argu This error tells us that the `DownstreamDoFn` received an `int` when it was expecting a `str`, but doesn't tell us who created that `int` in the first place. Who is the offending upstream transform that's responsible for this `int`? Presumably, _that_ transform's output type hints were too expansive (e.g. `any`) or otherwise non-existent because -no error was thrown during the runtime type check of its output. +no error was thrown during the runtime type check of its output. The problem here boils down to a lack of context. If we knew who our consumers were when type checking our output, we could simultaneously type check our output value against our output type constraints and every consumers' input type constraints to know whether there is _any_ possibility -for a mismatch. This is exactly what the mega type hint does, and it allows us to throw errors -at the point of declaration rather than the point of exception, saving you valuable time +for a mismatch. This is exactly what the mega type hint does, and it allows us to throw errors +at the point of declaration rather than the point of exception, saving you valuable time while providing higher quality error messages. So what would the same error look like using Performance RTC? It's the exact same string but with one additional line: ``` -[while running 'ParDo(UpstreamDoFn)'] +[while running 'ParDo(UpstreamDoFn)'] ``` And that's much more actionable for an investigation :) @@ -149,6 +149,6 @@ And that's much more actionable for an investigation :) # Next Steps Go play with the new `performance_runtime_type_check` feature! -It's in an experimental state so please -[let us know](https://beam.apache.org/community/contact-us/) -if you encounter any issues. +It's in an experimental state so please +[let us know](https://beam.apache.org/community/contact-us/) +if you encounter any issues. From 58bbad910c12bbb2285a0963d0fc5510f03c1455 Mon Sep 17 00:00:00 2001 From: Saavan Nanavati Date: Sat, 22 Aug 2020 14:39:56 -0500 Subject: [PATCH 6/6] Resolve PR comments --- .../en/blog/python-improved-annotations.md | 5 +++-- .../python-performance-runtime-type-checking.md | 16 ++++++++-------- 2 files changed, 11 insertions(+), 10 deletions(-) diff --git a/website/www/site/content/en/blog/python-improved-annotations.md b/website/www/site/content/en/blog/python-improved-annotations.md index 3f7e077534ac..775c5009264c 100644 --- a/website/www/site/content/en/blog/python-improved-annotations.md +++ b/website/www/site/content/en/blog/python-improved-annotations.md @@ -81,7 +81,7 @@ So how does this work? ## Typed PCollections You guessed it! The PCollection class inherits from `typing.Generic`, allowing it to be parameterized with either zero types (denoted `PCollection`) or one type (denoted `PCollection[T]`). -- A PCollection with zero types is implicitly converted to `PCollection[any]`. +- A PCollection with zero types is implicitly converted to `PCollection[Any]`. - A PCollection with one type can have any nested type (e.g. `Union[int, str]`). Internally, Beam's typing system makes these annotations compatible with other @@ -89,9 +89,10 @@ type hints by removing the outer PCollection container. ## PBegin, PDone, None Finally, besides PCollection, a valid annotation on the `expand(...)` method of a PTransform is -`PBegin`, `PDone`, and `None`. These are generally used for I/O operations. +`PBegin` or `None`. These are generally used for PTransforms that begin or end with an I/O operation. For instance, when saving data, your transform's output type should be `None`. + ``` class SaveResults(beam.PTransform): def expand(self, pcoll: PCollection[str]) -> None: diff --git a/website/www/site/content/en/blog/python-performance-runtime-type-checking.md b/website/www/site/content/en/blog/python-performance-runtime-type-checking.md index ef52691b9725..d9124909e3b3 100644 --- a/website/www/site/content/en/blog/python-performance-runtime-type-checking.md +++ b/website/www/site/content/en/blog/python-performance-runtime-type-checking.md @@ -42,18 +42,18 @@ p | beam.Create(['1', '2'] | beam.ParDo(MultiplyNumberByTwo()) In this code, we passed a list of strings to a DoFn that's clearly intended for use with integers. Luckily, this code will throw an error during pipeline construction because the inferred output type of `beam.Create(['1', '2'])` is `str` which is incompatible with -the declared input type hint of `MultiplyNumberByTwo.process` which is `int`. +the declared input type of `MultiplyNumberByTwo.process` which is `int`. -However, what if we turned the pipeline type check off using the `no_pipeline_type_check` -flag? Or more realistically, what if the input PCollection to MultiplyNumberByTwo came -from a database, preventing inference of the output data type? +However, what if we turned pipeline type checking off using the `no_pipeline_type_check` +flag? Or more realistically, what if the input PCollection to `MultiplyNumberByTwo` arrived +from a database, meaning that the output data type can only be known at runtime? In either case, no error would be thrown during pipeline construction. And even at runtime, this code works. Each string would be multiplied by 2, yielding a result of `['11', '22']`, but that's certainly not the outcome we want. -So how do you debug this breed of "hidden" errors? More broadly speaking, how do you -debug any error message in Beam that's complex or confusing (e.g. serialization errors)? +So how do you debug this breed of "hidden" errors? More broadly speaking, how do you debug +any typing or serialization error in Beam? The answer is to use runtime type checking. @@ -113,7 +113,7 @@ a sample in statistics. Initially, we sample a substantial number of elements, b confidence that the element type won't change over time increases, we reduce our sampling rate (up to a fixed minimum). -2. Whereas the old RTC system used heavy decorators to perform the type check, the new RTC system +2. Whereas the old RTC system used heavy wrappers to perform the type check, the new RTC system moves the type check to a Cython-optimized, non-decorated portion of the codebase. For reference, Cython is a programming language that gives C-like performance to Python code. @@ -129,7 +129,7 @@ Runtime type violation detected within ParDo(DownstreamDoFn): Type-hint for argu This error tells us that the `DownstreamDoFn` received an `int` when it was expecting a `str`, but doesn't tell us who created that `int` in the first place. Who is the offending upstream transform that's responsible for -this `int`? Presumably, _that_ transform's output type hints were too expansive (e.g. `any`) or otherwise non-existent because +this `int`? Presumably, _that_ transform's output type hints were too expansive (e.g. `Any`) or otherwise non-existent because no error was thrown during the runtime type check of its output. The problem here boils down to a lack of context. If we knew who our consumers were when type