Skip to content

[Bug]: FileBasedSink seems to not work on streaming #32113

@damccorm

Description

@damccorm

What happened?

I tried running a few different pipelines using ios based on filebasedsink. They all failed on Dataflow and/or locally with various errors when the --streaming flag is used. Adding a few here:

Writing to parquet:

    with beam.Pipeline(options=beam_options) as pipeline:
        elements = pipeline | "Create elements" >> beam.Create(["Hello", "World!"])
        import pyarrow
        elements | output | beam.io.WriteToParquet(
        "gs://path",
        schema=pyarrow.schema(
            [("name", pyarrow.binary()), ("age", pyarrow.int64())]
        ),
    )

Writing to textio:

    with beam.Pipeline(options=beam_options) as pipeline:
        elements = pipeline | "Create elements" >> beam.Create(["Hello", "World!"])
        elements | beam.io.WriteToText("./test")

This works locally, but fails on Dataflow before launch (unclear if it is the fault of transform or some translation layer, seems like it comes from this particular AsSingleton call -

AsSingleton(pre_finalize_coll)).with_output_types(str)

Writing to textio with periodic impulse source gives a different error:

    with beam.Pipeline(options=beam_options) as pipeline:
        elements = pipeline | "Create elements" >> PeriodicImpulse()  | 'ApplyWindowing' >> beam.WindowInto(
          beam.transforms.window.FixedWindows(20))
        elements | beam.io.WriteToText("./test")
ValueError: GroupByKey cannot be applied to an unbounded PCollection with global windowing and a default trigger

This isn't correct since this is clearly windowed before the write.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions