|
1 | | -# Cloud Dataproc API Examples |
| 1 | +Samples migrated |
| 2 | +================ |
2 | 3 |
|
3 | | -[![Open in Cloud Shell][shell_img]][shell_link] |
4 | | - |
5 | | -[shell_img]: http://gstatic.com/cloudssh/images/open-btn.png |
6 | | -[shell_link]: https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/GoogleCloudPlatform/python-docs-samples&page=editor&open_in_editor=dataproc/README.md |
7 | | - |
8 | | -Sample command-line programs for interacting with the Cloud Dataproc API. |
9 | | - |
10 | | -See [the tutorial on the using the Dataproc API with the Python client |
11 | | -library](https://cloud.google.com/dataproc/docs/tutorials/python-library-example) |
12 | | -for information on a walkthrough you can run to try out the Cloud Dataproc API sample code. |
13 | | - |
14 | | -Note that while this sample demonstrates interacting with Dataproc via the API, the functionality demonstrated here could also be accomplished using the Cloud Console or the gcloud CLI. |
15 | | - |
16 | | -`list_clusters.py` is a simple command-line program to demonstrate connecting to the Cloud Dataproc API and listing the clusters in a region. |
17 | | - |
18 | | -`submit_job_to_cluster.py` demonstrates how to create a cluster, submit the |
19 | | -`pyspark_sort.py` job, download the output from Google Cloud Storage, and output the result. |
20 | | - |
21 | | -`single_job_workflow.py` uses the Cloud Dataproc InstantiateInlineWorkflowTemplate API to create an ephemeral cluster, run a job, then delete the cluster with one API request. |
22 | | - |
23 | | -`pyspark_sort.py_gcs` is the same as `pyspark_sort.py` but demonstrates |
24 | | - reading from a GCS bucket. |
25 | | - |
26 | | -## Prerequisites to run locally: |
27 | | - |
28 | | -* [pip](https://pypi.python.org/pypi/pip) |
29 | | - |
30 | | -Go to the [Google Cloud Console](https://console.cloud.google.com). |
31 | | - |
32 | | -Under API Manager, search for the Google Cloud Dataproc API and enable it. |
33 | | - |
34 | | -## Set Up Your Local Dev Environment |
35 | | - |
36 | | -To install, run the following commands. If you want to use [virtualenv](https://virtualenv.readthedocs.org/en/latest/) |
37 | | -(recommended), run the commands within a virtualenv. |
38 | | - |
39 | | - * pip install -r requirements.txt |
40 | | - |
41 | | -## Authentication |
42 | | - |
43 | | -Please see the [Google cloud authentication guide](https://cloud.google.com/docs/authentication/). |
44 | | -The recommended approach to running these samples is a Service Account with a JSON key. |
45 | | - |
46 | | -## Environment Variables |
47 | | - |
48 | | -Set the following environment variables: |
49 | | - |
50 | | - GOOGLE_CLOUD_PROJECT=your-project-id |
51 | | - REGION=us-central1 # or your region |
52 | | - CLUSTER_NAME=waprin-spark7 |
53 | | - ZONE=us-central1-b |
54 | | - |
55 | | -## Running the samples |
56 | | - |
57 | | -To run list_clusters.py: |
58 | | - |
59 | | - python list_clusters.py $GOOGLE_CLOUD_PROJECT --region=$REGION |
60 | | - |
61 | | -`submit_job_to_cluster.py` can create the Dataproc cluster or use an existing cluster. To create a cluster before running the code, you can use the [Cloud Console](console.cloud.google.com) or run: |
62 | | - |
63 | | - gcloud dataproc clusters create your-cluster-name |
64 | | - |
65 | | -To run submit_job_to_cluster.py, first create a GCS bucket (used by Cloud Dataproc to stage files) from the Cloud Console or with gsutil: |
66 | | - |
67 | | - gsutil mb gs://<your-staging-bucket-name> |
68 | | - |
69 | | -Next, set the following environment variables: |
70 | | - |
71 | | - BUCKET=your-staging-bucket |
72 | | - CLUSTER=your-cluster-name |
73 | | - |
74 | | -Then, if you want to use an existing cluster, run: |
75 | | - |
76 | | - python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET |
77 | | - |
78 | | -Alternatively, to create a new cluster, which will be deleted at the end of the job, run: |
79 | | - |
80 | | - python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET --create_new_cluster |
81 | | - |
82 | | -The script will setup a cluster, upload the PySpark file, submit the job, print the result, then, if it created the cluster, delete the cluster. |
83 | | - |
84 | | -Optionally, you can add the `--pyspark_file` argument to change from the default `pyspark_sort.py` included in this script to a new script. |
| 4 | +New location: https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/dataproc/snippets |
0 commit comments