Dataflow (Apache Beam) with Python 3.5

In this little tutorial I will demo how to run a simple pipeline that read records from CSV files stored in Google Storage and upload them on Google BigQuery using Apache Beam (Google Dataflow) with Python3.5

Installation

git clone https://github.com/CalogeroZarbo/apache-beam-py3.5

cd apache-beam-py3.5

conda create -n apache-beam-py3.5 python=3.5

conda activate apache-beam-py3.5

pip install -r requirements.txt

Usage

In order to run this simple tutorial you would need:

Google Cloud Platform account
Google Cloud Storage folder where to put all the CSV files you would like to read
1. Create a GS Bucket and call it sample_bucket
BigQuery project called sample_project
1. BigQuery dataset under that project called dataflow_tutorial

The CSV table format (as per dataflow_tutorial/bigquery_table_specs.py) should look like:

COL1	COL2	COL3
val1	val2	val3

You can change the format, the name of the BigQuery project and the table name by modifying the file dataflow_tutorial/bigquery_table_specs.py.

export PROJECT=sample_project

export WORK_DIR=gs://sample_bucket/sample_data/

python preprocess.py --project $PROJECT --runner DataflowRunner --temp_location $WORK_DIR/beam-temp --setup_file ./setup.py --work-dir $WORK_DIR --region europe-west1

Brief Explanation of the files

setup.py will install the package with the pipeline to run in the Apache Beam distributed machines
preprocess.py is the main file where the Apache Beam pipeline is defined
dataflow_tutorial/ is the folder with the pipeline files needed to run the preprocessing properly
bigquery_table_specs.py contains the specifications for the tables on BigQuery
pipeline_utils.py contains the classes to read the CSV files, and handle the different chunks in different machines
record_utils.py containes the definitions of the processing steps to perform on the records that has been read from the CSV

Disclaimer

This tutorial is openly inspired by the official Google resource at: https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/molecules.

Please refer to it for more information on how to extend the pipeline and attach to it CMLE to Train & Serve ML models.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
dataflow_tutorial		dataflow_tutorial
.gitignore		.gitignore
README.md		README.md
preprocess.py		preprocess.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataflow (Apache Beam) with Python 3.5

Installation

Usage

Brief Explanation of the files

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dataflow (Apache Beam) with Python 3.5

Installation

Usage

Brief Explanation of the files

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages