In this little tutorial I will demo how to run a simple pipeline that read records from CSV files stored in Google Storage and upload them on Google BigQuery using Apache Beam (Google Dataflow) with Python3.5
git clone https://github.com/CalogeroZarbo/apache-beam-py3.5
cd apache-beam-py3.5
conda create -n apache-beam-py3.5 python=3.5
conda activate apache-beam-py3.5
pip install -r requirements.txt
In order to run this simple tutorial you would need:
- Google Cloud Platform account
- Google Cloud Storage folder where to put all the CSV files you would like to read
- Create a GS Bucket and call it
sample_bucket
- Create a GS Bucket and call it
- BigQuery project called
sample_project- BigQuery dataset under that project called
dataflow_tutorial
- BigQuery dataset under that project called
The CSV table format (as per dataflow_tutorial/bigquery_table_specs.py) should look like:
| COL1 | COL2 | COL3 |
|---|---|---|
| val1 | val2 | val3 |
You can change the format, the name of the BigQuery project and the table name by modifying the file dataflow_tutorial/bigquery_table_specs.py.
export PROJECT=sample_project
export WORK_DIR=gs://sample_bucket/sample_data/
python preprocess.py --project $PROJECT --runner DataflowRunner --temp_location $WORK_DIR/beam-temp --setup_file ./setup.py --work-dir $WORK_DIR --region europe-west1
setup.pywill install the package with the pipeline to run in the Apache Beam distributed machinespreprocess.pyis the main file where the Apache Beam pipeline is defineddataflow_tutorial/is the folder with the pipeline files needed to run the preprocessing properlybigquery_table_specs.pycontains the specifications for the tables on BigQuerypipeline_utils.pycontains the classes to read the CSV files, and handle the different chunks in different machinesrecord_utils.pycontaines the definitions of the processing steps to perform on the records that has been read from the CSV
This tutorial is openly inspired by the official Google resource at: https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/molecules.
Please refer to it for more information on how to extend the pipeline and attach to it CMLE to Train & Serve ML models.