-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Open
Description
What happened?
I tried running a few different pipelines using ios based on filebasedsink. They all failed on Dataflow and/or locally with various errors when the --streaming flag is used. Adding a few here:
Writing to parquet:
with beam.Pipeline(options=beam_options) as pipeline:
elements = pipeline | "Create elements" >> beam.Create(["Hello", "World!"])
import pyarrow
elements | output | beam.io.WriteToParquet(
"gs://path",
schema=pyarrow.schema(
[("name", pyarrow.binary()), ("age", pyarrow.int64())]
),
)
Writing to textio:
with beam.Pipeline(options=beam_options) as pipeline:
elements = pipeline | "Create elements" >> beam.Create(["Hello", "World!"])
elements | beam.io.WriteToText("./test")
This works locally, but fails on Dataflow before launch (unclear if it is the fault of transform or some translation layer, seems like it comes from this particular AsSingleton call -
beam/sdks/python/apache_beam/io/iobase.py
Line 1170 in 18849de
| AsSingleton(pre_finalize_coll)).with_output_types(str) |
Writing to textio with periodic impulse source gives a different error:
with beam.Pipeline(options=beam_options) as pipeline:
elements = pipeline | "Create elements" >> PeriodicImpulse() | 'ApplyWindowing' >> beam.WindowInto(
beam.transforms.window.FixedWindows(20))
elements | beam.io.WriteToText("./test")
ValueError: GroupByKey cannot be applied to an unbounded PCollection with global windowing and a default trigger
This isn't correct since this is clearly windowed before the write.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Infrastructure
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner