Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion website/www/site/content/en/documentation/io/connectors.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ This table provides a consolidated, at-a-glance overview of the available built-
<td class="present">✔</td>
<td class="present">
<a href="https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/AvroIO.html">native</a>
<a href="https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/extensions/avro/io/AvroIO.html">native</a>
</td>
<td class="present">
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ multiple worker instances in parallel. As such, the code you provide for
can use `SourceTestUtils` to increase your implementation's test coverage
using a wide range of inputs with relatively few lines of code. For
examples that use `SourceTestUtils`, see the
[AvroSourceTest](https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/io/AvroSourceTest.java) and
[AvroSourceTest](https://github.com/apache/beam/blob/master/sdks/java/extensions/avro/src/test/java/org/apache/beam/sdk/extensions/avro/io/AvroSourceTest.java) and
[TextIOReadTest](https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/io/TextIOReadTest.java)
source code.

Expand Down Expand Up @@ -344,7 +344,7 @@ sinks that interact with files, including:
implementations for examples:

* [TextSink](https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSink.java) and
* [AvroSink](https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSink.java).
* [AvroSink](https://github.com/apache/beam/blob/master/sdks/java/extensions/avro/src/main/java/org/apache/beam/sdk/extensions/avro/io/AvroSink.java).


## PTransform wrappers {#ptransform-wrappers}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1213,7 +1213,7 @@ When possible, unit tests are favored over integration tests due to faster execu
<p>Tests that the source/sink populates display data correctly.
</td>
<td>
<p><a href="https://github.com/apache/beam/blob/c57c983c8ae7d84926f9cf42f7c40af8eaf60545/sdks/java/core/src/test/java/org/apache/beam/sdk/io/AvroIOTest.java#L167">AvroIOTest.testReadDisplayData</a>
<p><a href="https://github.com/apache/beam/blob/8bda63bc8ea0c1de9ec29d0da080df1769c65a2b/sdks/java/extensions/avro/src/test/java/org/apache/beam/sdk/extensions/avro/io/AvroIOTest.java#L174">AvroIOTest.testReadDisplayData</a>
<p><a href="https://github.com/apache/beam/blob/f9ae6d53e2e6ad8346cee955d646f7198dbb6502/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/datastore/DatastoreV1Test.java#L220">DatastoreV1Test.testReadDisplayData</a>
<p><a href="https://github.com/apache/beam/blob/4012a46d3aa7b2a4c628f1352c8b579733c71b41/sdks/python/apache_beam/io/gcp/bigquery_test.py#L187">bigquery_test.TestBigQuerySourcetest_table_reference_display_data</a>
</td>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ Because KFP provides the input and output arguments as command-line arguments, a
{{< code_sample "sdks/python/apache_beam/examples/ml-orchestration/kfp/components/preprocessing/src/preprocess.py" preprocess_component_argparse >}}
{{< /highlight >}}

The implementation of the `preprocess_dataset` function contains the Apache Beam pipeline code and the Beam pipeline options that select the runner. The executed preprocessing involves downloading the image bytes from their URL, converting them to a Torch Tensor, and resizing to the desired size. The caption undergoes a series of string manipulations to ensure that our model receives uniform image descriptions. Tokenization is not done here, but could be included here if the vocabulary is known. Finally, each element is serialized and written to [Avro](https://avro.apache.org/docs/1.2.0/) files. You can use alternative files formats, such as TFRecords.
The implementation of the `preprocess_dataset` function contains the Apache Beam pipeline code and the Beam pipeline options that select the runner. The executed preprocessing involves downloading the image bytes from their URL, converting them to a Torch Tensor, and resizing to the desired size. The caption undergoes a series of string manipulations to ensure that our model receives uniform image descriptions. Tokenization is not done here, but could be included here if the vocabulary is known. Finally, each element is serialized and written to [Avro](https://avro.apache.org/docs/) files. You can use alternative files formats, such as TFRecords.


{{< highlight file="sdks/python/apache_beam/examples/ml-orchestration/kfp/components/preprocessing/src/preprocess.py" >}}
Expand Down