Skip to content

Conversation

@moiseenkov
Copy link
Contributor

@moiseenkov moiseenkov commented Jul 5, 2023

This PR fixes the following case.

The goal is to copy source/foo.txt to dest/foo.txt within a single GCS bucket.

  1. Create a GCS bucket and upload two files to source directory like this:
gs://my-bucket/source/foo.txt
gs://my-bucket/source/foo.txt.abc
  1. Upload the following DAG to a Cloud Composer environment:
from airflow import DAG
from airflow.providers.google.cloud.transfers.gcs_to_gcs import GCSToGCSOperator
from datetime import datetime

with DAG(
    dag_id="gcs_to_gcs_fail_example",
    schedule_interval=None,
    catchup=False,
    start_date=datetime(2021,1,1)
) as dag:
    copy_file = GCSToGCSOperator(
        task_id="copy_file",
        source_bucket="my-bucket",
        source_object="source/foo.txt",
        destination_object="dest/foo.txt",
        exact_match=True,
    )
    copy_file
  1. Run the DAG

Expected bucket state:

gs://my-bucket/source/foo.txt
gs://my-bucket/source/foo.txt.abc
gs://my-bucket/dest/foo.txt

Actual (incorrect) bucket state:

gs://my-bucket/source/foo.txt
gs://my-bucket/source/foo.txt.abc
gs://my-bucket/dest/foo.txt/source/foo.txt

======================================================

The reason for this bug was the lack of handling exact_match=True when objects are being copied without a wildcard. This problem is fixed in the current PR.

======================================================

However, if the flag is set to its default value exact_match=False, then the operator's result is different:

gs://my-bucket/source/foo.txt
gs://my-bucket/source/foo.txt.abc
gs://my-bucket/dest/foo.txt/source/foo.txt
gs://my-bucket/dest/foo.txt/source/foo.txt.abc

It's actually correct, because in general source_object="path/to/the/file.txt" is not treated as a file path, but as an object name prefix (doc). That's why the prefix source_object="path/to/the/file.txt" corresponds to both objects:

gs://my-bucket/source/foo.txt
gs://my-bucket/source/foo.txt.abc

And if the destination_object is set, then the destination object prefix is just built as a concatenation of the source prefix and the destination prefix. There is no difference for GCS what is being copied: a file or a folder - both of these entities are the same things - objects.

Perhaps, it makes sense to implement more "human friendly" logic, so the operator would act with inputs as with files and folders, but I think it should be another operator, because GCSToGCSOperator's current implementation became too complicated for major changes. This is just my thoughts, I'm not insisting.

@boring-cyborg boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Jul 5, 2023
@eladkal
Copy link
Contributor

eladkal commented Jul 5, 2023

Bugfix for the following case (b/289486604).

What is that?

@moiseenkov moiseenkov force-pushed the gcs_to_gcs_bugfix branch from 9beabfb to 2f68ae1 Compare July 5, 2023 14:47
@moiseenkov
Copy link
Contributor Author

Bugfix for the following case (b/289486604).

What is that?

Reference to our internal ticket. Removed it.

@eladkal
Copy link
Contributor

eladkal commented Jul 5, 2023

Reference to our internal ticket. Removed it.

Can you please edit title and amend comment to a meaningful title?

@moiseenkov moiseenkov changed the title Bugfix for GCSToGCSOperator (b/289486604) Bugfix GCSToGCSOperator when copy object without wildcard and exact_match=True Jul 5, 2023
@moiseenkov moiseenkov changed the title Bugfix GCSToGCSOperator when copy object without wildcard and exact_match=True Bugfix GCSToGCSOperator when copy an object without wildcard and exact_match=True Jul 5, 2023
@moiseenkov
Copy link
Contributor Author

Reference to our internal ticket. Removed it.

Can you please edit title and amend comment to a meaningful title?

Sure! Done.

@moiseenkov moiseenkov force-pushed the gcs_to_gcs_bugfix branch from 2f68ae1 to b396840 Compare July 6, 2023 06:52
@moiseenkov moiseenkov force-pushed the gcs_to_gcs_bugfix branch from b396840 to 83f21d1 Compare July 6, 2023 07:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants