diff --git a/CHANGES.md b/CHANGES.md
index 1dd3908bf33a..4d51702caced 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -70,6 +70,7 @@
## New Features / Improvements
* X feature added (Java/Python) ([#X](https://github.com/apache/beam/issues/X)).
+* OpenAI text embeddings notebook added (Python) ([#37344](https://github.com/apache/beam/issues/37344)).
## Breaking Changes
diff --git a/examples/notebooks/beam-ml/data_preprocessing/open_ai_text_embeddings.ipynb b/examples/notebooks/beam-ml/data_preprocessing/open_ai_text_embeddings.ipynb
new file mode 100644
index 000000000000..bdd138422d88
--- /dev/null
+++ b/examples/notebooks/beam-ml/data_preprocessing/open_ai_text_embeddings.ipynb
@@ -0,0 +1,411 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 47,
+ "metadata": {
+ "id": "UmEFwsNs1OES"
+ },
+ "outputs": [],
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZUSiAR62SgO8"
+ },
+ "source": [
+ "# Generate text embeddings by using OpenAI models\n",
+ "\n",
+ "
\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "yvVIEhF01ZWq"
+ },
+ "source": [
+ "\n",
+ "Use text embeddings to represent text as numerical vectors. This process lets computers understand and process text data, which is essential for many natural language processing (NLP) tasks.\n",
+ "\n",
+ "The following NLP tasks use embeddings:\n",
+ "\n",
+ "* **Semantic search:** Find documents or passages that are relevant to a query when the query doesn't use the exact same words as the documents.\n",
+ "* **Text classification:** Categorize text data into different classes, such as spam and not spam, or positive sentiment and negative sentiment.\n",
+ "* **Machine translation:** Translate text from one language to another and preserve the meaning.\n",
+ "* **Text summarization:** Create shorter summaries of text.\n",
+ "\n",
+ "This notebook uses Apache Beam's `MLTransform` to generate embeddings from text data using OpenAI's embedding models.\n",
+ "\n",
+ "OpenAI provides powerful embedding models like `text-embedding-3-small` and `text-embedding-3-large` that can generate high-quality text embeddings. These models support configurable dimensions, allowing you to balance between embedding quality and storage/computation costs.\n",
+ "\n",
+ "To generate text embeddings using OpenAI models with `MLTransform`, use the `OpenAITextEmbeddings` module to specify the model configuration.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jqYXaBJ821Zs"
+ },
+ "source": [
+ "## Install dependencies\n",
+ "\n",
+ "Install Apache Beam and the dependencies needed to work with OpenAI embeddings."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 48,
+ "metadata": {
+ "id": "shzCUrZI1XhF"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\u001b[33mWARNING: apache-beam 2.72.0.dev0 does not provide the extra 'openai'\u001b[0m\u001b[33m\n",
+ "\u001b[0m"
+ ]
+ }
+ ],
+ "source": [
+ "! pip install 'apache_beam[interactive,openai]>=2.71.0' --quiet"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 49,
+ "metadata": {
+ "id": "jVxSi2jS3M3b"
+ },
+ "outputs": [],
+ "source": [
+ "import logging\n",
+ "logging.getLogger().setLevel(logging.ERROR)\n",
+ "\n",
+ "import tempfile\n",
+ "import apache_beam as beam\n",
+ "from apache_beam.ml.transforms.base import MLTransform\n",
+ "from apache_beam.ml.transforms.embeddings.open_ai import OpenAITextEmbeddings"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kXDM8C7d3nPW"
+ },
+ "source": [
+ "## Set up your OpenAI API key\n",
+ "\n",
+ "To use OpenAI's embedding models, you need an API key. You can get one from [OpenAI's platform](https://platform.openai.com/api-keys).\n",
+ "\n",
+ "Set the `OPENAI_API_KEY` environment variable before running this notebook."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 50,
+ "metadata": {
+ "id": "auth_setup"
+ },
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "\n",
+ "OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kXDM8C7d3nPV"
+ },
+ "source": [
+ "## Process the data\n",
+ "\n",
+ "`MLTransform` is a `PTransform` that you can use for data preparation, including generating text embeddings.\n",
+ "\n",
+ "### Use MLTransform in write mode\n",
+ "\n",
+ "In `write` mode, `MLTransform` saves the transforms and their attributes to an artifact location. Then, when you run `MLTransform` in `read` mode, these transforms are used. This process ensures that you're applying the same preprocessing steps when you train your model and when you serve the model in production or test its accuracy.\n",
+ "\n",
+ "For more information about using `MLTransform`, see [Preprocess data with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in the Apache Beam documentation."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Dbkmu3HP6Kql"
+ },
+ "source": [
+ "### Get the data\n",
+ "\n",
+ "The following text inputs come from the Hugging Face blog [Getting Started With Embeddings](https://huggingface.co/blog/getting-started-with-embeddings).\n",
+ "\n",
+ "\n",
+ "`MLTransform` operates on dictionaries of data. To generate embeddings for specific columns, provide the column names as input to the `columns` argument in the `OpenAITextEmbeddings` class."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 51,
+ "metadata": {
+ "id": "LCTUs8F73iDg"
+ },
+ "outputs": [],
+ "source": [
+ "content = [\n",
+ " {'x': 'Apache Beam is an open source unified programming model.'},\n",
+ " {'x': 'It allows you to define both batch and streaming data pipelines.'},\n",
+ " {'x': 'Beam provides a portable API layer for building pipelines.'},\n",
+ " {'x': 'You can run Beam pipelines on multiple execution engines.'},\n",
+ " {'x': 'Runners execute pipelines on distributed processing backends.'},\n",
+ "]\n",
+ "\n",
+ "# Using text-embedding-3-small model - a cost-effective option.\n",
+ "# Other options: text-embedding-3-large, text-embedding-ada-002\n",
+ "text_embedding_model_name = 'text-embedding-3-small'\n",
+ "\n",
+ "# helper function that returns a dict containing only first\n",
+ "# ten elements of generated embeddings.\n",
+ "def truncate_embeddings(d):\n",
+ " for key in d.keys():\n",
+ " d[key] = d[key][:10]\n",
+ " return d"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "SApMmlRLRv_e"
+ },
+ "source": [
+ "\n",
+ "### Generate text embeddings\n",
+ "This example uses the model `text-embedding-3-small` to generate text embeddings. For more information about OpenAI embedding models, see [OpenAI's embeddings documentation](https://platform.openai.com/docs/guides/embeddings).\n",
+ "\n",
+ "The `text-embedding-3-small` model produces 1536-dimensional embeddings by default, but you can use the `dimensions` parameter to reduce this."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 52,
+ "metadata": {
+ "id": "SF6izkN134sf"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Embedding shape: 1536\n",
+ "{'x': [-0.009753434918820858, 0.0038477969355881214, 0.03195132687687874, -0.003576868213713169, 0.024280086159706116, -0.003038055030629039, -0.009223753586411476, -0.004340948071330786, 0.018362272530794144, -0.05050842463970184]}\n",
+ "Embedding shape: 1536\n",
+ "{'x': [0.012837349437177181, -0.024922415614128113, 0.04935948923230171, -0.029872925952076912, 0.007735170423984528, -0.03977394476532936, 0.010453097522258759, 0.015591677278280258, -0.031911369413137436, 0.016720101237297058]}\n",
+ "Embedding shape: 1536\n",
+ "{'x': [0.00837371964007616, -0.02793511003255844, 0.034448761492967606, -0.007256315555423498, 0.01797385886311531, -0.014417242258787155, -0.026722317561507225, -0.022675134241580963, 0.024800928309559822, 0.0005595539114437997]}\n",
+ "Embedding shape: 1536\n",
+ "{'x': [0.018137292936444283, 0.0439058393239975, 0.08075538277626038, -0.06779270619153976, -0.004390583839267492, -0.0013206052826717496, -0.01129007339477539, 0.009728540666401386, -0.036117780953645706, -0.013060680590569973]}\n",
+ "Embedding shape: 1536\n",
+ "{'x': [0.019719384610652924, 0.03634314239025116, 0.0876525416970253, -0.036635641008615494, 0.03017626516520977, -0.02396063692867756, 0.03229689225554466, 0.016367821022868156, 0.01803750917315483, 0.019719384610652924]}\n"
+ ]
+ }
+ ],
+ "source": [
+ "artifact_location = tempfile.mkdtemp(prefix='openai_')\n",
+ "embedding_transform = OpenAITextEmbeddings(\n",
+ " model_name=text_embedding_model_name,\n",
+ " columns=['x'],\n",
+ " api_key=OPENAI_API_KEY)\n",
+ "\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " data_pcoll = (\n",
+ " pipeline\n",
+ " | \"CreateData\" >> beam.Create(content))\n",
+ " transformed_pcoll = (\n",
+ " data_pcoll\n",
+ " | \"MLTransform\" >> MLTransform(write_artifact_location=artifact_location).with_transform(embedding_transform))\n",
+ "\n",
+ " transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >> beam.Map(print)\n",
+ "\n",
+ " transformed_pcoll | \"PrintEmbeddingShape\" >> beam.Map(lambda x: print(f\"Embedding shape: {len(x['x'])}\"))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1MFom0PW_vRv"
+ },
+ "source": [
+ "### Configuring embedding dimensions\n",
+ "\n",
+ "OpenAI's newer embedding models (`text-embedding-3-small` and `text-embedding-3-large`) support the `dimensions` parameter, which allows you to shorten embeddings without losing too much accuracy. This can help reduce storage costs and improve computation speed.\n",
+ "\n",
+ "For example, you can reduce the dimensions from 1536 to 256:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 53,
+ "metadata": {
+ "id": "xyezKuzY_uLD"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Embedding shape: 256\n",
+ "{'x': [-0.020514478906989098, 0.00809310283511877, 0.06720348447561264, -0.00752325588837266, 0.0510685034096241, -0.006389965768903494, -0.019400397315621376, -0.00913035124540329, 0.03862151503562927, -0.10623478144407272]}\n",
+ "Embedding shape: 256\n",
+ "{'x': [0.025633791461586952, -0.049765415489673615, 0.09856168925762177, -0.05965065583586693, 0.015445691533386707, -0.07942114025354385, 0.02087288349866867, 0.03113366849720478, -0.06372105330228806, 0.033386923372745514]}\n",
+ "Embedding shape: 256\n",
+ "{'x': [0.017627887427806854, -0.058807432651519775, 0.07251960784196854, -0.015275589190423489, 0.03783756121993065, -0.030350372195243835, -0.05625433102250099, -0.047734424471855164, 0.0522095263004303, 0.0011779415654018521]}\n",
+ "Embedding shape: 256\n",
+ "{'x': [0.03754354268312454, 0.09088350087404251, 0.1671607345342636, -0.14032845199108124, -0.009088350459933281, -0.0027336054481565952, -0.0233700443059206, 0.02013772912323475, -0.0747625008225441, -0.02703513763844967]}\n",
+ "Embedding shape: 256\n",
+ "{'x': [0.03932701796293259, 0.07248032838106155, 0.1748083531856537, -0.0730636715888977, 0.06018151715397835, -0.04778548702597618, 0.0644107535481453, 0.03264288231730461, 0.03597279638051987, 0.03932701796293259]}\n"
+ ]
+ }
+ ],
+ "source": [
+ "artifact_location_with_dims = tempfile.mkdtemp(prefix='openai_dims_')\n",
+ "\n",
+ "embedding_transform_with_dims = OpenAITextEmbeddings(\n",
+ " model_name=text_embedding_model_name,\n",
+ " columns=['x'],\n",
+ " api_key=OPENAI_API_KEY,\n",
+ " dimensions=256 # Reduce embedding dimensions\n",
+ " )\n",
+ "\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " data_pcoll = (\n",
+ " pipeline\n",
+ " | \"CreateData\" >> beam.Create(content))\n",
+ " transformed_pcoll = (\n",
+ " data_pcoll\n",
+ " | \"MLTransform\" >> MLTransform(write_artifact_location=artifact_location_with_dims).with_transform(embedding_transform_with_dims))\n",
+ "\n",
+ " transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >> beam.Map(print)\n",
+ "\n",
+ " transformed_pcoll | \"PrintEmbeddingShape\" >> beam.Map(lambda x: print(f\"Embedding shape: {len(x['x'])}\"))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "aPIQzCoF_EBj"
+ },
+ "source": [
+ "### Use MLTransform in read mode\n",
+ "\n",
+ "In `read` mode, `MLTransform` uses the artifacts generated during `write` mode. In this case, the `OpenAITextEmbeddings` transform and its attributes are loaded from the saved artifacts. You don't need to specify the artifacts again during `read` mode.\n",
+ "\n",
+ "In this way, `MLTransform` provides consistent preprocessing steps for training and inference workloads."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 54,
+ "metadata": {
+ "id": "RCqYeUd3_F3C"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "{'x': [0.00047562166582793, 0.02407737262547016, 0.10070080310106277, -0.030040575191378593, 0.011932645924389362, -0.030789095908403397, 0.04226639121770859, 0.033109504729509354, 0.021158145740628242, -0.01302423607558012]}\n",
+ "{'x': [0.007936494424939156, -0.01662219502031803, 0.03890787065029144, -0.03299042209982872, -0.011180933564901352, -0.041066598147153854, 0.04319992661476135, -0.009568237699568272, -0.03382851555943489, 0.03431105241179466]}\n",
+ "{'x': [0.022309930995106697, -0.03803618252277374, 0.04467129707336426, 0.023711536079645157, 0.014851856976747513, 0.0012834640219807625, -0.00548747181892395, -0.005561409518122673, -0.023698676377534866, -0.018503742292523384]}\n"
+ ]
+ }
+ ],
+ "source": [
+ "test_content = [\n",
+ " {'x': 'What runners does Apache Beam support?'},\n",
+ " {'x': 'How do I create a streaming pipeline?'},\n",
+ " {'x': 'A PCollection represents a distributed dataset.'},\n",
+ "]\n",
+ "\n",
+ "# Uses the saved artifacts to generate text embeddings.\n",
+ "with beam.Pipeline() as pipeline:\n",
+ " data_pcoll = (\n",
+ " pipeline\n",
+ " | \"CreateData\" >> beam.Create(test_content))\n",
+ " transformed_pcoll = (\n",
+ " data_pcoll\n",
+ " | \"MLTransform\" >> MLTransform(read_artifact_location=artifact_location))\n",
+ "\n",
+ " transformed_pcoll | beam.Map(truncate_embeddings) | 'LogOutput' >> beam.Map(print)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "l31V3Q0Uo41z"
+ },
+ "source": [
+ "# Next Steps\n",
+ "\n",
+ "Now that you've generated embeddings, you can use MLTransform and Sinks to ingest your data into a Vector Database. For this, along with more advanced concepts, check out the following notebooks:\n",
+ "\n",
+ "- [Vector Embedding Ingestion with Apache Beam and AlloyDB](https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/alloydb_product_catalog_embeddings.ipynb)\n",
+ "- [Embedding Ingestion and Vector Search with Apache Beam and BigQuery](https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/bigquery_vector_ingestion_and_search.ipynb)\n",
+ "- [Vector Embedding Ingestion with Apache Beam and CloudSQL Postgres](https://colab.sandbox.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/cloudsql_postgres_product_catalog_embeddings.ipynb)"
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.13.11"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/website/www/site/content/en/documentation/ml/preprocess-data.md b/website/www/site/content/en/documentation/ml/preprocess-data.md
index fb21cc8928b2..25b52af6740c 100644
--- a/website/www/site/content/en/documentation/ml/preprocess-data.md
+++ b/website/www/site/content/en/documentation/ml/preprocess-data.md
@@ -55,10 +55,11 @@ You can use `MLTransform` to generate text embeddings and to perform various dat
You can use `MLTranform` to generate embeddings that you can use to push data into vector databases or to run inference.
{{< table >}}
-| Transform name | Description |
-| ------- | ---------------|
-| SentenceTransformerEmbeddings | Uses the Hugging Face [`sentence-transformers`](https://huggingface.co/sentence-transformers) models to generate text embeddings.
-| VertexAITextEmbeddings | Uses models from the [the Vertex AI text-embeddings API](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) to generate text embeddings.
+| Transform name | Description | Notebook |
+| ------- | ---------------| -------- |
+| SentenceTransformerEmbeddings | Uses the Hugging Face [`sentence-transformers`](https://huggingface.co/sentence-transformers) models to generate text embeddings. | [Hugging Face Text Embeddings](https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb) |
+| VertexAITextEmbeddings | Uses models from the [the Vertex AI text-embeddings API](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) to generate text embeddings. | [Vertex AI Text Embeddings](https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/vertex_ai_text_embeddings.ipynb) |
+| OpenAITextEmbeddings | Uses [OpenAI's embedding models](https://platform.openai.com/docs/guides/embeddings) to generate text embeddings. | [OpenAI Text Embeddings](https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/beam-ml/data_preprocessing/open_ai_text_embeddings.ipynb) |
{{< /table >}}