diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 239f70dce8..01595f8ca8 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -30,6 +30,14 @@ repos:
         language: python
         additional_dependencies:
           - https://files.pythonhosted.org/packages/cc/20/ff623b09d963f88bfde16306a54e12ee5ea43e9b597108672ff3a408aad6/pathspec-0.12.1-py3-none-any.whl
+      
+      - id: no-markdown-in-docs-source
+        name: Prevent markdown files in docs/source directories
+        entry: bash -c
+        args: ['if find . -path "*/docs/source/*.md" -not -path "./docs/README.md" | grep -q .; then echo "ERROR: Markdown files found in docs/source/ directories. Use reST (.rst) instead."; exit 1; fi']
+        language: system
+        pass_filenames: false
+        always_run: true
 
   - repo: https://github.com/PyCQA/bandit
     rev: 2d0b675b04c80ae42277e10500db06a0a37bae17  # frozen: 1.8.6
diff --git a/cuda_bindings/docs/source/_static/logo-dark-mode.png b/cuda_bindings/docs/source/_static/logo-dark-mode.png
deleted file mode 100644
index 6b005a283b..0000000000
Binary files a/cuda_bindings/docs/source/_static/logo-dark-mode.png and /dev/null differ
diff --git a/cuda_bindings/docs/source/_static/logo-light-mode.png b/cuda_bindings/docs/source/_static/logo-light-mode.png
deleted file mode 100644
index c07d6848c9..0000000000
Binary files a/cuda_bindings/docs/source/_static/logo-light-mode.png and /dev/null differ
diff --git a/cuda_core/docs/source/conduct.md b/cuda_bindings/docs/source/conduct.rst
similarity index 82%
rename from cuda_core/docs/source/conduct.md
rename to cuda_bindings/docs/source/conduct.rst
index ccc1e4a43a..b70d9dd7ce 100644
--- a/cuda_core/docs/source/conduct.md
+++ b/cuda_bindings/docs/source/conduct.rst
@@ -1,10 +1,16 @@
-# Code of Conduct
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
 
-## Overview
+Code of Conduct
+===============
 
-Define the code of conduct followed and enforced for the `cuda.core` project.
+Overview
+--------
 
-## Our Pledge
+Define the code of conduct followed and enforced for the ``cuda.bindings`` project.
+
+Our Pledge
+----------
 
 In the interest of fostering an open and welcoming environment, we as
 contributors and maintainers pledge to making participation in our project and
@@ -13,7 +19,8 @@ size, disability, ethnicity, sex characteristics, gender identity and expression
 level of experience, education, socio-economic status, nationality, personal
 appearance, race, religion, or sexual identity and orientation.
 
-## Our Standards
+Our Standards
+-------------
 
 Examples of behavior that contributes to creating a positive environment
 include:
@@ -35,7 +42,8 @@ Examples of unacceptable behavior by participants include:
 * Other conduct which could reasonably be considered inappropriate in a
   professional setting
 
-## Our Responsibilities
+Our Responsibilities
+--------------------
 
 Project maintainers are responsible for clarifying the standards of acceptable
 behavior and are expected to take appropriate and fair corrective action in
@@ -47,7 +55,8 @@ that are not aligned to this Code of Conduct, or to ban temporarily or
 permanently any contributor for other behaviors that they deem inappropriate,
 threatening, offensive, or harmful.
 
-## Scope
+Scope
+-----
 
 This Code of Conduct applies both within project spaces and in public spaces
 when an individual is representing the project or its community. Examples of
@@ -56,11 +65,12 @@ address, posting via an official social media account, or acting as an appointed
 representative at an online or offline event. Representation of a project may be
 further defined and clarified by project maintainers.
 
-## Enforcement
+Enforcement
+-----------
 
 Instances of abusive, harassing, or otherwise unacceptable behavior may be
 reported by contacting the project team at
-[cuda-python-conduct@nvidia.com](mailto:cuda-python-conduct@nvidia.com) All
+`cuda-python-conduct@nvidia.com <mailto:cuda-python-conduct@nvidia.com>`_ All
 complaints will be reviewed and investigated and will result in a response that
 is deemed necessary and appropriate to the circumstances. The project team is
 obligated to maintain confidentiality with regard to the reporter of an
@@ -71,12 +81,11 @@ Project maintainers who do not follow or enforce the Code of Conduct in good
 faith may face temporary or permanent repercussions as determined by other
 members of the project's leadership.
 
-## Attribution
+Attribution
+-----------
 
-This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+This Code of Conduct is adapted from the `Contributor Covenant <https://www.contributor-covenant.org>`_, version 1.4,
 available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
 
-[homepage]: https://www.contributor-covenant.org
-
 For answers to common questions about this code of conduct, see
 https://www.contributor-covenant.org/faq
diff --git a/cuda_bindings/docs/source/contribute.md b/cuda_bindings/docs/source/contribute.md
deleted file mode 100644
index d26f117234..0000000000
--- a/cuda_bindings/docs/source/contribute.md
+++ /dev/null
@@ -1,12 +0,0 @@
-# Contributing
-
-Thank you for your interest in contributing to `cuda-bindings`! Based on the type of contribution, it will fall into two categories:
-
-1. You want to report a bug, feature request, or documentation issue
-    - File an [issue](https://github.com/NVIDIA/cuda-python/issues/new/choose)
-    describing what you encountered or what you want to see changed.
-    - The NVIDIA team will evaluate the issues and triage them, scheduling
-    them for a release. If you believe the issue needs priority attention
-    comment on the issue to notify the team.
-2. You want to implement a feature, improvement, or bug fix:
-    - At this time we do not accept code contributions.
diff --git a/cuda_bindings/docs/source/contribute.rst b/cuda_bindings/docs/source/contribute.rst
new file mode 100644
index 0000000000..20c7f51bc9
--- /dev/null
+++ b/cuda_bindings/docs/source/contribute.rst
@@ -0,0 +1,15 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+Contributing
+============
+
+Thank you for your interest in contributing to ``cuda-bindings``! Based on the type of contribution, it will fall into two categories:
+
+1. You want to report a bug, feature request, or documentation issue
+    - File an `issue <https://github.com/NVIDIA/cuda-python/issues/new/choose>`_ describing what you encountered or what you want to see changed.
+    - The NVIDIA team will evaluate the issues and triage them, scheduling
+    them for a release. If you believe the issue needs priority attention
+    comment on the issue to notify the team.
+2. You want to implement a feature, improvement, or bug fix:
+    - At this time we do not accept code contributions.
diff --git a/cuda_bindings/docs/source/environment_variables.md b/cuda_bindings/docs/source/environment_variables.md
deleted file mode 100644
index 7329e582cf..0000000000
--- a/cuda_bindings/docs/source/environment_variables.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Environment Variables
-
-## Build-Time Environment Variables
-
-- `CUDA_HOME` or `CUDA_PATH`: Specifies the location of the CUDA Toolkit.
-
-- `CUDA_PYTHON_PARSER_CACHING` : bool, toggles the caching of parsed header files during the cuda-bindings build process. If caching is enabled (`CUDA_PYTHON_PARSER_CACHING` is True), the cache path is set to ./cache_<library_name>, where <library_name> is derived from the cuda toolkit libraries used to build cuda-bindings.
-
-- `CUDA_PYTHON_PARALLEL_LEVEL` (previously `PARALLEL_LEVEL`) : int, sets the number of threads used in the compilation of extension modules. Not setting it or setting it to 0 would disable parallel builds.
-
-## Runtime Environment Variables
-
-- `CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM` : When set to 1, the default stream is the per-thread default stream. When set to 0, the default stream is the legacy default stream. This defaults to 0, for the legacy default stream. See [Stream Synchronization Behavior](https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html) for an explanation of the legacy and per-thread default streams.
diff --git a/cuda_bindings/docs/source/environment_variables.rst b/cuda_bindings/docs/source/environment_variables.rst
new file mode 100644
index 0000000000..c582fe57b3
--- /dev/null
+++ b/cuda_bindings/docs/source/environment_variables.rst
@@ -0,0 +1,21 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+Environment Variables
+=====================
+
+Runtime Environment Variables
+-----------------------------
+
+- ``CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM`` : When set to 1, the default stream is the per-thread default stream. When set to 0, the default stream is the legacy default stream. This defaults to 0, for the legacy default stream. See `Stream Synchronization Behavior <https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html>`_ for an explanation of the legacy and per-thread default streams.
+
+
+Build-Time Environment Variables
+--------------------------------
+
+- ``CUDA_HOME`` or ``CUDA_PATH``: Specifies the location of the CUDA Toolkit.
+
+- ``CUDA_PYTHON_PARSER_CACHING`` : bool, toggles the caching of parsed header files during the cuda-bindings build process. If caching is enabled (``CUDA_PYTHON_PARSER_CACHING`` is True), the cache path is set to ./cache_<library_name>, where <library_name> is derived from the cuda toolkit libraries used to build cuda-bindings.
+
+- ``CUDA_PYTHON_PARALLEL_LEVEL`` (previously ``PARALLEL_LEVEL``) : int, sets the number of threads used in the compilation of extension modules. Not setting it or setting it to 0 would disable parallel builds.
+
diff --git a/cuda_bindings/docs/source/index.rst b/cuda_bindings/docs/source/index.rst
index 5fc9418514..3501b26a5c 100644
--- a/cuda_bindings/docs/source/index.rst
+++ b/cuda_bindings/docs/source/index.rst
@@ -9,15 +9,15 @@
    :caption: Contents:
 
    release
-   install.md
-   overview.md
-   motivation.md
-   environment_variables.md
+   install
+   overview
+   motivation
+   environment_variables
    api
    tips_and_tricks
    support
-   contribute.md
-   conduct.md
+   contribute
+   conduct
    license
 
 
diff --git a/cuda_bindings/docs/source/install.md b/cuda_bindings/docs/source/install.md
deleted file mode 100644
index b7c693b9c2..0000000000
--- a/cuda_bindings/docs/source/install.md
+++ /dev/null
@@ -1,88 +0,0 @@
-# Installation
-
-## Runtime Requirements
-
-`cuda.bindings` supports the same platforms as CUDA. Runtime dependencies are:
-
-* Linux (x86-64, arm64) and Windows (x86-64)
-* Python 3.9 - 3.13
-* Driver: Linux (580.65.06 or later) Windows (580.88 or later)
-* Optionally, NVRTC, nvJitLink, NVVM, and cuFile from CUDA Toolkit 13.x
-
-```{note}
-The optional CUDA Toolkit components are now installed via the `cuda-toolkit` metapackage from PyPI for improved dependency resolution. Components can also be installed via Conda, OS-specific package managers, or local installers (as described in the CUDA Toolkit [Windows](https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html) and [Linux](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html) Installation Guides).
-```
-
-Starting from v12.8.0, `cuda-python` becomes a meta package which currently depends only on `cuda-bindings`; in the future more sub-packages will be added to `cuda-python`. In the instructions below, we still use `cuda-python` as example to serve existing users, but everything is applicable to `cuda-bindings` as well.
-
-
-## Installing from PyPI
-
-```console
-$ pip install -U cuda-python
-```
-
-Install all optional dependencies with:
-```{code-block} shell
-pip install -U cuda-python[all]
-```
-
-Where the optional dependencies include:
-
-* `nvidia-cuda-nvrtc` (NVRTC runtime compilation library)  
-* `nvidia-nvjitlink` (nvJitLink library)
-* `nvidia-nvvm` (NVVM library)
-* `nvidia-cufile` (cuFile library, Linux only)
-
-These are now installed through the `cuda-toolkit` metapackage for improved dependency resolution.
-
-
-## Installing from Conda
-
-```console
-$ conda install -c conda-forge cuda-python
-```
-
-```{note}
-When using conda, the `cuda-version` metapackage can be used to control the versions of CUDA Toolkit components that are installed to the conda environment.
-```
-
-For example:
-```console
-$ conda install -c conda-forge cuda-python cuda-version=13
-```
-
-
-## Installing from Source
-
-### Requirements
-
-* CUDA Toolkit headers[^1]
-* CUDA Runtime static library[^2]
-
-[^1]: User projects that `cimport` CUDA symbols in Cython must also use CUDA Toolkit (CTK) types as provided by the `cuda.bindings` major.minor version. This results in CTK headers becoming a transitive dependency of downstream projects through CUDA Python.
-
-[^2]: The CUDA Runtime static library (`libcudart_static.a` on Linux, `cudart_static.lib` on Windows) is part of the CUDA Toolkit. If using conda packages, it is contained in the `cuda-cudart-static` package.
-
-Source builds require that the provided CUDA headers are of the same major.minor version as the `cuda.bindings` you're trying to build. Despite this requirement, note that the minor version compatibility is still maintained. Use the `CUDA_HOME` (or `CUDA_PATH`) environment variable to specify the location of your headers. For example, if your headers are located in `/usr/local/cuda/include`, then you should set `CUDA_HOME` with:
-
-```console
-$ export CUDA_HOME=/usr/local/cuda
-```
-
-See [Environment Variables](environment_variables.md) for a description of other build-time environment variables.
-
-```{note}
-Only `cydriver`, `cyruntime` and `cynvrtc` are impacted by the header requirement.
-```
-
-
-### Editable Install
-
-You can use
-
-```console
-$ pip install -v -e .
-```
-
-to install the module as editable in your current Python environment (e.g. for testing of porting other libraries to use the binding).
diff --git a/cuda_bindings/docs/source/install.rst b/cuda_bindings/docs/source/install.rst
new file mode 100644
index 0000000000..b9335b4876
--- /dev/null
+++ b/cuda_bindings/docs/source/install.rst
@@ -0,0 +1,96 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+Installation
+============
+
+Runtime Requirements
+--------------------
+
+``cuda.bindings`` supports the same platforms as CUDA. Runtime dependencies are:
+
+* Linux (x86-64, arm64) and Windows (x86-64)
+* Python 3.9 - 3.13
+* Driver: Linux (580.65.06 or later) Windows (580.88 or later)
+* Optionally, NVRTC, nvJitLink, NVVM, and cuFile from CUDA Toolkit 13.x
+
+.. note::
+
+   The optional CUDA Toolkit components are now installed via the ``cuda-toolkit`` metapackage from PyPI for improved dependency resolution. Components can also be installed via Conda, OS-specific package managers, or local installers (as described in the CUDA Toolkit `Windows <https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html>`_ and `Linux <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ Installation Guides).
+
+Starting from v12.8.0, ``cuda-python`` becomes a meta package which currently depends only on ``cuda-bindings``; in the future more sub-packages will be added to ``cuda-python``. In the instructions below, we still use ``cuda-python`` as example to serve existing users, but everything is applicable to ``cuda-bindings`` as well.
+
+Installing from PyPI
+--------------------
+
+.. code-block:: console
+
+   $ pip install -U cuda-python
+
+Install all optional dependencies with:
+
+.. code-block:: console
+
+   $ pip install -U cuda-python[all]
+
+Where the optional dependencies include:
+
+* ``nvidia-cuda-nvrtc`` (NVRTC runtime compilation library)  
+* ``nvidia-nvjitlink`` (nvJitLink library)
+* ``nvidia-nvvm`` (NVVM library)
+* ``nvidia-cufile`` (cuFile library, Linux only)
+
+These are now installed through the ``cuda-toolkit`` metapackage for improved dependency resolution.
+
+Installing from Conda
+---------------------
+
+.. code-block:: console
+
+   $ conda install -c conda-forge cuda-python
+
+.. note::
+
+   When using conda, the ``cuda-version`` metapackage can be used to control the versions of CUDA Toolkit components that are installed to the conda environment.
+
+For example:
+
+.. code-block:: console
+
+   $ conda install -c conda-forge cuda-python cuda-version=13
+
+Installing from Source
+----------------------
+
+Requirements
+^^^^^^^^^^^^
+
+* CUDA Toolkit headers[^1]
+* CUDA Runtime static library[^2]
+
+[^1]: User projects that ``cimport`` CUDA symbols in Cython must also use CUDA Toolkit (CTK) types as provided by the ``cuda.bindings`` major.minor version. This results in CTK headers becoming a transitive dependency of downstream projects through CUDA Python.
+
+[^2]: The CUDA Runtime static library (``libcudart_static.a`` on Linux, ``cudart_static.lib`` on Windows) is part of the CUDA Toolkit. If using conda packages, it is contained in the ``cuda-cudart-static`` package.
+
+Source builds require that the provided CUDA headers are of the same major.minor version as the ``cuda.bindings`` you're trying to build. Despite this requirement, note that the minor version compatibility is still maintained. Use the ``CUDA_HOME`` (or ``CUDA_PATH``) environment variable to specify the location of your headers. For example, if your headers are located in ``/usr/local/cuda/include``, then you should set ``CUDA_HOME`` with:
+
+.. code-block:: console
+
+   $ export CUDA_HOME=/usr/local/cuda
+
+See `Environment Variables <environment_variables.rst>`_ for a description of other build-time environment variables.
+
+.. note::
+
+   Only ``cydriver``, ``cyruntime`` and ``cynvrtc`` are impacted by the header requirement.
+
+Editable Install
+^^^^^^^^^^^^^^^^
+
+You can use:
+
+.. code-block:: console
+
+   $ pip install -v -e .
+
+to install the module as editable in your current Python environment (e.g. for testing of porting other libraries to use the binding).
diff --git a/cuda_bindings/docs/source/motivation.md b/cuda_bindings/docs/source/motivation.rst
similarity index 73%
rename from cuda_bindings/docs/source/motivation.md
rename to cuda_bindings/docs/source/motivation.rst
index 5b8879f2bc..afbd3412d4 100644
--- a/cuda_bindings/docs/source/motivation.md
+++ b/cuda_bindings/docs/source/motivation.rst
@@ -1,7 +1,12 @@
-# Motivation
-## What is CUDA Python?
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
 
-NVIDIA’s CUDA Python provides [Cython](https://cython.org/) bindings and Python
+Motivation
+==========
+What is CUDA Python?
+--------------------
+
+NVIDIA’s CUDA Python provides `Cython <https://cython.org/>`_ bindings and Python
 wrappers for the driver and runtime API for existing toolkits and libraries to
 simplify GPU-based accelerated processing. Python is one of the most popular
 programming languages for science, engineering, data analytics, and deep 
@@ -9,14 +14,15 @@ learning applications.  The goal of CUDA Python is to unify
 the Python ecosystem with a single set of interfaces that provide full coverage
 of and access to the CUDA host APIs from Python.
 
-## Why CUDA Python?
+Why CUDA Python?
+----------------
 
 CUDA Python provides uniform APIs and bindings for inclusion into existing
 toolkits and libraries to simplify GPU-based parallel processing for HPC, data
 science, and AI.
 
-[Numba](https://numba.pydata.org/), a Python compiler from
-[Anaconda](https://www.anaconda.com/) that can compile Python code for execution
+`Numba <https://numba.pydata.org/>`_, a Python compiler from
+`Anaconda <https://www.anaconda.com/>`_ that can compile Python code for execution
 on CUDA-capable GPUs, provides Python developers with an easy entry into
 GPU-accelerated computing and a path for using increasingly sophisticated CUDA
 code with a minimum of new syntax and jargon. Numba has its own CUDA driver API 
@@ -24,9 +30,9 @@ bindings that can now be replaced with CUDA Python. With CUDA Python and Numba,
 you get the best of both worlds: rapid iterative development with Python and the
 speed of a compiled language targeting both CPUs and NVIDIA GPUs.
 
-[CuPy](https://cupy.dev/) is a
-[NumPy](https://numpy.org/)/[SciPy](https://www.scipy.org/) compatible Array
-library, from [Preferred Networks](https://www.preferred.jp/en/), for
+`CuPy <https://cupy.dev/>`_ is a
+`NumPy <https://numpy.org/>`_/`SciPy <https://www.scipy.org/>`_ compatible Array
+library, from `Preferred Networks <https://www.preferred.jp/en/>`_, for
 GPU-accelerated computing with Python. CUDA Python simplifies the CuPy build 
 and allows for a faster and smaller memory footprint when importing the CuPy 
 Python module. In the future, when more CUDA Toolkit libraries are supported, 
diff --git a/cuda_bindings/docs/source/overview.md b/cuda_bindings/docs/source/overview.md
deleted file mode 100644
index 1168d926f5..0000000000
--- a/cuda_bindings/docs/source/overview.md
+++ /dev/null
@@ -1,558 +0,0 @@
-# Overview
-
-Python plays a key role within the science, engineering, data analytics, and
-deep learning application ecosystem. NVIDIA has long been committed to helping
-the Python ecosystem leverage the accelerated massively parallel performance of
-GPUs to deliver standardized libraries, tools, and applications. Today, we're
-introducing another step towards simplification of the developer experience with
-improved Python code portability and compatibility.
-
-Our goal is to help unify the Python CUDA ecosystem with a single standard set
-of low-level interfaces, providing full coverage and access to the CUDA host
-APIs from Python. We want to provide an ecosystem foundation to allow
-interoperability among different accelerated libraries. Most importantly, it
-should be easy for Python developers to use NVIDIA GPUs.
-
-## `cuda.bindings` workflow
-
-Because Python is an interpreted language, you need a way to compile the device
-code into
-[PTX](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html) and
-then extract the function to be called at a later point in the application. You
-construct your device code in the form of a string and compile it with
-[NVRTC](http://docs.nvidia.com/cuda/nvrtc/index.html), a runtime compilation
-library for CUDA C++. Using the NVIDIA [Driver
-API](http://docs.nvidia.com/cuda/cuda-driver-api/index.html), manually create a
-CUDA context and all required resources on the GPU, then launch the compiled
-CUDA C++ code and retrieve the results from the GPU. Now that you have an
-overview, jump into a commonly used example for parallel programming:
-[SAXPY](https://developer.nvidia.com/blog/six-ways-saxpy/).
-
-The first thing to do is import the [Driver
-API](https://docs.nvidia.com/cuda/cuda-driver-api/index.html) and
-[NVRTC](https://docs.nvidia.com/cuda/nvrtc/index.html) modules from the `cuda.bindings`
-package. Next, we consider how to store host data and pass it to the device. Different
-approaches can be used to accomplish this and are described in [Preparing kernel
-arguments](https://nvidia.github.io/cuda-python/cuda-bindings/latest/overview.html#preparing-kernel-arguments).
-In this example, we will use NumPy to store host data and pass it to the device, so let's
-import this dependency as well.
-
-```python
-from cuda.bindings import driver, nvrtc
-import numpy as np
-```
-
-Error checking is a fundamental best practice when working with low-level interfaces.
-The following code snippet lets us validate each API call and raise exceptions in case of error.
-
-```python
-def _cudaGetErrorEnum(error):
-    if isinstance(error, driver.CUresult):
-        err, name = driver.cuGetErrorName(error)
-        return name if err == driver.CUresult.CUDA_SUCCESS else "<unknown>"
-    elif isinstance(error, nvrtc.nvrtcResult):
-        return nvrtc.nvrtcGetErrorString(error)[1]
-    else:
-        raise RuntimeError('Unknown error type: {}'.format(error))
-
-def checkCudaErrors(result):
-    if result[0].value:
-        raise RuntimeError("CUDA error code={}({})".format(result[0].value, _cudaGetErrorEnum(result[0])))
-    if len(result) == 1:
-        return None
-    elif len(result) == 2:
-        return result[1]
-    else:
-        return result[1:]
-```
-
-It's common practice to write CUDA kernels near the top of a translation unit,
-so write it next. The entire kernel is wrapped in triple quotes to form a
-string. The string is compiled later using NVRTC. This is the only part of CUDA
-Python that requires some understanding of CUDA C++. For more information, see
-[An Even Easier Introduction to
-CUDA](https://developer.nvidia.com/blog/even-easier-introduction-cuda/).
-
-```python
-saxpy = """\
-extern "C" __global__
-void saxpy(float a, float *x, float *y, float *out, size_t n)
-{
- size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
- if (tid < n) {
-   out[tid] = a * x[tid] + y[tid];
- }
-}
-"""
-```
-Go ahead and compile the kernel into PTX. Remember that this is executed at runtime using NVRTC. There are three basic steps to NVRTC:
-
-- Create a program from the string.
-- Compile the program.
-- Extract PTX from the compiled program.
-
-In the following code example, the Driver API is initialized so that the NVIDIA driver
-and GPU are accessible. Next, the GPU is queried for their compute capability. Finally,
-the program is compiled to target our local compute capability architecture with FMAD disabled.
-
-```python
-# Initialize CUDA Driver API
-checkCudaErrors(driver.cuInit(0))
-
-# Retrieve handle for device 0
-cuDevice = checkCudaErrors(driver.cuDeviceGet(0))
-
-# Derive target architecture for device 0
-major = checkCudaErrors(driver.cuDeviceGetAttribute(driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, cuDevice))
-minor = checkCudaErrors(driver.cuDeviceGetAttribute(driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, cuDevice))
-arch_arg = bytes(f'--gpu-architecture=compute_{major}{minor}', 'ascii')
-
-# Create program
-prog = checkCudaErrors(nvrtc.nvrtcCreateProgram(str.encode(saxpy), b"saxpy.cu", 0, [], []))
-
-# Compile program
-opts = [b"--fmad=false", arch_arg]
-checkCudaErrors(nvrtc.nvrtcCompileProgram(prog, 2, opts))
-
-# Get PTX from compilation
-ptxSize = checkCudaErrors(nvrtc.nvrtcGetPTXSize(prog))
-ptx = b" " * ptxSize
-checkCudaErrors(nvrtc.nvrtcGetPTX(prog, ptx))
-```
-
-Before you can use the PTX or do any work on the GPU, you must create a CUDA
-context. CUDA contexts are analogous to host processes for the device. In the
-following code example, a handle for compute device 0 is passed to
-`cuCtxCreate` to designate that GPU for context creation.
-
-```python
-# Create context
-context = checkCudaErrors(driver.cuCtxCreate(0, cuDevice))
-```
-
-With a CUDA context created on device 0, load the PTX generated earlier into a
-module. A module is analogous to dynamically loaded libraries for the device.
-After loading into the module, extract a specific kernel with
-`cuModuleGetFunction`. It is not uncommon for multiple kernels to reside in PTX.
-
-```python
-# Load PTX as module data and retrieve function
-ptx = np.char.array(ptx)
-# Note: Incompatible --gpu-architecture would be detected here
-module = checkCudaErrors(driver.cuModuleLoadData(ptx.ctypes.data))
-kernel = checkCudaErrors(driver.cuModuleGetFunction(module, b"saxpy"))
-```
-
-Next, get all your data prepared and transferred to the GPU. For increased
-application performance, you can input data on the device to eliminate data
-transfers. For completeness, this example shows how you would transfer data to
-and from the device.
-
-```python
-NUM_THREADS = 512  # Threads per block
-NUM_BLOCKS = 32768  # Blocks per grid
-
-a = np.array([2.0], dtype=np.float32)
-n = np.array(NUM_THREADS * NUM_BLOCKS, dtype=np.uint32)
-bufferSize = n * a.itemsize
-
-hX = np.random.rand(n).astype(dtype=np.float32)
-hY = np.random.rand(n).astype(dtype=np.float32)
-hOut = np.zeros(n).astype(dtype=np.float32)
-```
-
-With the input data `a`, `x`, and `y` created for the SAXPY transform device,
-resources must be allocated to store the data using `cuMemAlloc`. To allow for
-more overlap between compute and data movement, use the asynchronous function
-`cuMemcpyHtoDAsync`. It returns control to the CPU immediately following command
-execution.
-
-Python doesn't have a natural concept of pointers, yet `cuMemcpyHtoDAsync` expects
-`void*`. This is where we leverage NumPy's data types to retrieve each host data pointer
-by calling `XX.ctypes.data` for the associated XX.
-
-```python
-dXclass = checkCudaErrors(driver.cuMemAlloc(bufferSize))
-dYclass = checkCudaErrors(driver.cuMemAlloc(bufferSize))
-dOutclass = checkCudaErrors(driver.cuMemAlloc(bufferSize))
-
-stream = checkCudaErrors(driver.cuStreamCreate(0))
-
-checkCudaErrors(driver.cuMemcpyHtoDAsync(
-   dXclass, hX.ctypes.data, bufferSize, stream
-))
-checkCudaErrors(driver.cuMemcpyHtoDAsync(
-   dYclass, hY.ctypes.data, bufferSize, stream
-))
-```
-
-With data prep and resources allocation finished, the kernel is ready to be
-launched. To pass the location of the data on the device to the kernel execution
-configuration, you must retrieve the device pointer. In the following code
-example, we call `int(XXclass)` to retrieve the device pointer value for the
-associated XXclass as a Python `int` and wrap it in a `np.array` type.
-
-```python
-dX = np.array([int(dXclass)], dtype=np.uint64)
-dY = np.array([int(dYclass)], dtype=np.uint64)
-dOut = np.array([int(dOutclass)], dtype=np.uint64)
-```
-
-The launch API `cuLaunchKernel` also expects a pointer input for the argument list
-but this time it's of type `void**`. What this means is that our argument list needs to
-be a contiguous array of `void*` elements, where each element is the pointer to a kernel
-argument on either host or device. Since we already prepared each of our arguments into a `np.array` type, the
-construction of our final contiguous array is done by retrieving the `XX.ctypes.data`
-of each kernel argument.
-
-```python
-args = [a, dX, dY, dOut, n]
-args = np.array([arg.ctypes.data for arg in args], dtype=np.uint64)
-```
-
-Now the kernel can be launched:
-
-```python
-checkCudaErrors(driver.cuLaunchKernel(
-   kernel,
-   NUM_BLOCKS,  # grid x dim
-   1,  # grid y dim
-   1,  # grid z dim
-   NUM_THREADS,  # block x dim
-   1,  # block y dim
-   1,  # block z dim
-   0,  # dynamic shared memory
-   stream,  # stream
-   args.ctypes.data,  # kernel arguments
-   0,  # extra (ignore)
-))
-
-checkCudaErrors(driver.cuMemcpyDtoHAsync(
-   hOut.ctypes.data, dOutclass, bufferSize, stream
-))
-checkCudaErrors(driver.cuStreamSynchronize(stream))
-```
-
-The `cuLaunchKernel` function takes the compiled module kernel and execution
-configuration parameters. The device code is launched in the same stream as the
-data transfers. That ensures that the kernel's compute is performed only after
-the data has finished transfer, as all API calls and kernel launches within a
-stream are serialized. After the call to transfer data back to the host is
-executed, `cuStreamSynchronize` is used to halt CPU execution until all operations
-in the designated stream are finished.
-
-```python
-# Assert values are same after running kernel
-hZ = a * hX + hY
-if not np.allclose(hOut, hZ):
-   raise ValueError("Error outside tolerance for host-device vectors")
-```
-
-Perform verification of the data to ensure correctness and finish the code with
-memory clean up.
-
-```python
-checkCudaErrors(driver.cuStreamDestroy(stream))
-checkCudaErrors(driver.cuMemFree(dXclass))
-checkCudaErrors(driver.cuMemFree(dYclass))
-checkCudaErrors(driver.cuMemFree(dOutclass))
-checkCudaErrors(driver.cuModuleUnload(module))
-checkCudaErrors(driver.cuCtxDestroy(context))
-```
-
-## Performance
-
-Performance is a primary driver in targeting GPUs in your application. So, how
-does the above code compare to its C++ version? Table 1 shows that the results
-are nearly identical. [NVIDIA NSight
-Systems](https://developer.nvidia.com/nsight-systems) was used to retrieve
-kernel performance and [CUDA
-Events](https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/)
-was used for application performance.
-
-The following command was used to profile the applications:
-
-```{code-block} shell
-nsys profile -s none -t cuda --stats=true <executable>
-```
-
-```{list-table} Kernel and application performance comparison.
-:header-rows: 1
-
-* -
-  - C++
-  - Python 
-* - Kernel execution
-  - 352µs
-  - 352µs
-* - Application execution
-  - 1076ms
-  - 1080ms
-```
-
-`cuda.bindings` is also compatible with [NVIDIA Nsight
-Compute](https://developer.nvidia.com/nsight-compute), which is an
-interactive kernel profiler for CUDA applications. It allows you to have
-detailed insights into kernel performance. This is useful when you're trying to
-maximize performance ({numref}`Figure 1`).
-
-```{figure} _static/images/Nsight-Compute-CLI-625x473.png
-:name: Figure 1
-
-Screenshot of Nsight Compute CLI output of `cuda.bindings` example.
-```
-
-## Preparing kernel arguments
-
-The `cuLaunchKernel` API bindings retain low-level CUDA argument preparation requirements:
-
-* Each kernel argument is a `void*` (i.e. pointer to the argument)
-* `kernelParams` is a `void**` (i.e. pointer to a list of kernel arguments)
-* `kernelParams` arguments are in contiguous memory
-
-These requirements can be met with two different approaches, using either NumPy or ctypes.
-
-### Using NumPy
-
-NumPy [Array objects](https://numpy.org/doc/stable/reference/arrays.html) can be used to fulfill each of these conditions directly.
-
-Let's use the following kernel definition as an example:
-```python
-kernel_string = """\
-typedef struct {
-    int value;
-} testStruct;
-
-extern "C" __global__
-void testkernel(int i, int *pi,
-                float f, float *pf,
-                testStruct s, testStruct *ps)
-{
-    *pi = i;
-    *pf = f;
-    ps->value = s.value;
-}
-"""
-```
-
-The first step is to create array objects with types corresponding to your kernel arguments. Primitive NumPy types have the following corresponding kernel types:
-
-```{list-table} Correspondence between NumPy types and kernel types.
-:header-rows: 1
-
-* - NumPy type
-  - Corresponding kernel types
-  - itemsize (bytes)
-* - bool
-  - bool
-  - 1
-* - int8
-  - char, signed char, int8_t
-  - 1
-* - int16
-  - short, signed short, int16_t
-  - 2
-* - int32
-  - int, signed int, int32_t
-  - 4
-* - int64
-  - long long, signed long long, int64_t
-  - 8
-* - uint8
-  - unsigned char, uint8_t
-  - 1
-* - uint16
-  - unsigned short, uint16_t
-  - 2
-* - uint32
-  - unsigned int, uint32_t
-  - 4
-* - uint64
-  - unsigned long long, uint64_t
-  - 8
-* - float16
-  - half
-  - 2
-* - float32
-  - float
-  - 4
-* - float64
-  - double
-  - 8
-* - complex64
-  - float2, cuFloatComplex, complex&lt;float&gt;
-  - 8
-* - complex128
-  - double2, cuDoubleComplex, complex&lt;double&gt;
-  - 16
-```
-
-Furthermore, custom NumPy types can be used to support both platform-dependent types and user-defined structures as kernel arguments.
-
-This example uses the following types:
-* `int` is `np.uint32`
-* `float` is `np.float32`
-* `int*`, `float*` and `testStruct*` are `np.intp`
-* `testStruct` is a custom user type `np.dtype([("value", np.int32)], align=True)`
-
-Note how all three pointers are `np.intp` since the pointer values are always a representation of an address space.
-
-Putting it all together:
-```python
-# Define a custom type
-testStruct = np.dtype([("value", np.int32)], align=True)
-
-# Allocate device memory
-pInt = checkCudaErrors(cudart.cudaMalloc(np.dtype(np.int32).itemsize))
-pFloat = checkCudaErrors(cudart.cudaMalloc(np.dtype(np.float32).itemsize))
-pStruct = checkCudaErrors(cudart.cudaMalloc(testStruct.itemsize))
-
-# Collect all input kernel arguments into a single tuple for further processing
-kernelValues = (
-    np.array(1, dtype=np.uint32),
-    np.array([pInt], dtype=np.intp),
-    np.array(123.456, dtype=np.float32),
-    np.array([pFloat], dtype=np.intp),
-    np.array([5], testStruct),
-    np.array([pStruct], dtype=np.intp),
-)
-```
-
-The final step is to construct a `kernelParams` argument that fulfills all of the launch API conditions. This is made easy because each array object comes
-with a [ctypes](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.ctypes.html#numpy.ndarray.ctypes) data attribute that returns the underlying `void*` pointer value.
-
-By having the final array object contain all pointers, we fulfill the contiguous array requirement:
-
-```python
-kernelParams = np.array([arg.ctypes.data for arg in kernelValues], dtype=np.intp)
-```
-
-The launch API supports [Buffer Protocol](https://docs.python.org/3/c-api/buffer.html) objects, therefore we can pass the array object directly.
-
-```python
-checkCudaErrors(cuda.cuLaunchKernel(
-    kernel,
-    1, 1, 1,  # grid dim
-    1, 1, 1,  # block dim
-    0, stream,  # shared mem and stream
-    kernelParams=kernelParams,
-    extra=0,
-))
-```
-
-### Using ctypes
-
-The [ctypes](https://docs.python.org/3/library/ctypes.html) approach relaxes the parameter preparation requirement by delegating the contiguous memory requirement to the API launch call.
-
-Let's use the same kernel definition as the previous section for the example.
-
-The ctypes approach treats the `kernelParams` argument as a pair of two tuples: `kernel_values` and `kernel_types`.
-
-* `kernel_values` contain Python values to be used as an input to your kernel
-* `kernel_types` contain the data types that your kernel_values should be converted into
-
-The ctypes [fundamental data types](https://docs.python.org/3/library/ctypes.html#fundamental-data-types) documentation describes the compatibility between different Python types and C types.
-Furthermore, [custom data types](https://docs.python.org/3/library/ctypes.html#calling-functions-with-your-own-custom-data-types) can be used to support kernels with custom types.
-
-For this example the result becomes:
-
-```python
-# Define a custom type
-class testStruct(ctypes.Structure):
-    _fields_ = [("value", ctypes.c_int)]
-
-# Allocate device memory
-pInt = checkCudaErrors(cudart.cudaMalloc(ctypes.sizeof(ctypes.c_int)))
-pFloat = checkCudaErrors(cudart.cudaMalloc(ctypes.sizeof(ctypes.c_float)))
-pStruct = checkCudaErrors(cudart.cudaMalloc(ctypes.sizeof(testStruct)))
-
-# Collect all input kernel arguments into a single tuple for further processing
-kernelValues = (
-    1,
-    pInt,
-    123.456,
-    pFloat,
-    testStruct(5),
-    pStruct,
-)
-kernelTypes = (
-    ctypes.c_int,
-    ctypes.c_void_p,
-    ctypes.c_float,
-    ctypes.c_void_p,
-    None,
-    ctypes.c_void_p,
-)
-```
-
-Values that are set to `None` have a special meaning:
-
-1. The value supports a callable `getPtr` that returns the pointer address of the underlining C object address (e.g. all CUDA C types that are exposed to Python as Python classes)
-2. The value is an instance of `ctypes.Structure`
-3. The value is an `Enum`
-
-In all three cases, the API call will fetch the underlying pointer value and construct a contiguous array with other kernel parameters.
-
-With the setup complete, the kernel can be launched:
-
-```python
-checkCudaErrors(cuda.cuLaunchKernel(
-    kernel,
-    1, 1, 1,  # grid dim
-    1, 1, 1,  # block dim
-    0, stream,  # shared mem and stream
-    kernelParams=(kernelValues, kernelTypes),
-    extra=0,
-))
-```
-
-### CUDA objects
-
-Certain CUDA kernels use native CUDA types as their parameters such as `cudaTextureObject_t`. These types require special handling since they're neither a primitive ctype nor a custom user type. Since `cuda.bindings` exposes each of them as Python classes, they each implement `getPtr()` and `__int__()`. These two callables used to support the NumPy and ctypes approach. The difference between each call is further described under [Tips and Tricks](https://nvidia.github.io/cuda-python/cuda-bindings/latest/tips_and_tricks.html#).
-
-For this example, lets use the `transformKernel` from [examples/0_Introduction/simpleCubemapTexture_test.py](https://github.com/NVIDIA/cuda-python/blob/main/cuda_bindings/examples/0_Introduction/simpleCubemapTexture_test.py):
-
-```python
-simpleCubemapTexture = """\
-extern "C"
-__global__ void transformKernel(float *g_odata, int width, cudaTextureObject_t tex)
-{
-    ...
-}
-"""
-
-def main():
-    ...
-    d_data = checkCudaErrors(cudart.cudaMalloc(size))
-    width = 64
-    tex = checkCudaErrors(cudart.cudaCreateTextureObject(texRes, texDescr, None))
-    ...
-```
-
-For NumPy, we can convert these CUDA types by leveraging the `__int__()` call to fetch the address of the underlying `cudaTextureObject_t` C object and wrapping it in a NumPy object array of type `np.intp`:
-
-```python
-kernelValues = (
-    np.array([d_data], dtype=np.intp),
-    np.array(width, dtype=np.uint32),
-    np.array([int(tex)], dtype=np.intp),
-)
-kernelArgs = np.array([arg.ctypes.data for arg in kernelValues], dtype=np.intp)
-```
-
-For ctypes, we leverage the special handling of `None` type since each Python class already implements `getPtr()`:
-
-```python
-kernelValues = (
-    d_data,
-    width,
-    tex,
-)
-kernelTypes = (
-    ctypes.c_void_p,
-    ctypes.c_int,
-    None,
-)
-kernelArgs = (kernelValues, kernelTypes)
-```
-
diff --git a/cuda_bindings/docs/source/overview.rst b/cuda_bindings/docs/source/overview.rst
new file mode 100644
index 0000000000..0f32032528
--- /dev/null
+++ b/cuda_bindings/docs/source/overview.rst
@@ -0,0 +1,568 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+Overview
+========
+
+Python plays a key role within the science, engineering, data analytics, and
+deep learning application ecosystem. NVIDIA has long been committed to helping
+the Python ecosystem leverage the accelerated massively parallel performance of
+GPUs to deliver standardized libraries, tools, and applications. Today, we're
+introducing another step towards simplification of the developer experience with
+improved Python code portability and compatibility.
+
+Our goal is to help unify the Python CUDA ecosystem with a single standard set
+of low-level interfaces, providing full coverage and access to the CUDA host
+APIs from Python. We want to provide an ecosystem foundation to allow
+interoperability among different accelerated libraries. Most importantly, it
+should be easy for Python developers to use NVIDIA GPUs.
+
+``cuda.bindings`` workflow
+---------------------------
+
+Because Python is an interpreted language, you need a way to compile the device
+code into
+`PTX <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html>`_ and
+then extract the function to be called at a later point in the application. You
+construct your device code in the form of a string and compile it with
+`NVRTC <http://docs.nvidia.com/cuda/nvrtc/index.html>`_, a runtime compilation
+library for CUDA C++. Using the NVIDIA `Driver
+API <http://docs.nvidia.com/cuda/cuda-driver-api/index.html>`_, manually create a
+CUDA context and all required resources on the GPU, then launch the compiled
+CUDA C++ code and retrieve the results from the GPU. Now that you have an
+overview, jump into a commonly used example for parallel programming:
+`SAXPY <https://developer.nvidia.com/blog/six-ways-saxpy/>`_.
+
+The first thing to do is import the `Driver
+API <https://docs.nvidia.com/cuda/cuda-driver-api/index.html>`_ and
+`NVRTC <https://docs.nvidia.com/cuda/nvrtc/index.html>`_ modules from the ``cuda.bindings``
+package. Next, we consider how to store host data and pass it to the device. Different
+approaches can be used to accomplish this and are described in `Preparing kernel
+arguments <https://nvidia.github.io/cuda-python/cuda-bindings/latest/overview.html#preparing-kernel-arguments>`_.
+In this example, we will use NumPy to store host data and pass it to the device, so let's
+import this dependency as well.
+
+.. code-block:: python
+
+   from cuda.bindings import driver, nvrtc
+   import numpy as np
+
+Error checking is a fundamental best practice when working with low-level interfaces.
+The following code snippet lets us validate each API call and raise exceptions in case of error:
+
+.. code-block:: python
+
+   def _cudaGetErrorEnum(error):
+       if isinstance(error, driver.CUresult):
+           err, name = driver.cuGetErrorName(error)
+           return name if err == driver.CUresult.CUDA_SUCCESS else "<unknown>"
+       elif isinstance(error, nvrtc.nvrtcResult):
+           return nvrtc.nvrtcGetErrorString(error)[1]
+       else:
+           raise RuntimeError('Unknown error type: {}'.format(error))
+   
+   def checkCudaErrors(result):
+       if result[0].value:
+           raise RuntimeError("CUDA error code={}({})".format(result[0].value, _cudaGetErrorEnum(result[0])))
+       if len(result) == 1:
+           return None
+       elif len(result) == 2:
+           return result[1]
+       else:
+           return result[1:]
+
+It's common practice to write CUDA kernels near the top of a translation unit,
+so write it next. The entire kernel is wrapped in triple quotes to form a
+string. The string is compiled later using NVRTC. This is the only part of CUDA
+Python that requires some understanding of CUDA C++. For more information, see
+`An Even Easier Introduction to
+CUDA <https://developer.nvidia.com/blog/even-easier-introduction-cuda/>`_.
+
+.. code-block:: python
+
+   saxpy = """\
+   extern "C" __global__
+   void saxpy(float a, float *x, float *y, float *out, size_t n)
+   {
+    size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
+    if (tid < n) {
+      out[tid] = a * x[tid] + y[tid];
+    }
+   }
+   """
+
+Go ahead and compile the kernel into PTX. Remember that this is executed at runtime using NVRTC. There are three basic steps to NVRTC:
+
+- Create a program from the string.
+- Compile the program.
+- Extract PTX from the compiled program.
+
+In the following code example, the Driver API is initialized so that the NVIDIA driver
+and GPU are accessible. Next, the GPU is queried for their compute capability. Finally,
+the program is compiled to target our local compute capability architecture with FMAD disabled:
+
+.. code-block:: python
+
+   # Initialize CUDA Driver API
+   checkCudaErrors(driver.cuInit(0))
+   
+   # Retrieve handle for device 0
+   cuDevice = checkCudaErrors(driver.cuDeviceGet(0))
+   
+   # Derive target architecture for device 0
+   major = checkCudaErrors(driver.cuDeviceGetAttribute(driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, cuDevice))
+   minor = checkCudaErrors(driver.cuDeviceGetAttribute(driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, cuDevice))
+   arch_arg = bytes(f'--gpu-architecture=compute_{major}{minor}', 'ascii')
+   
+   # Create program
+   prog = checkCudaErrors(nvrtc.nvrtcCreateProgram(str.encode(saxpy), b"saxpy.cu", 0, [], []))
+   
+   # Compile program
+   opts = [b"--fmad=false", arch_arg]
+   checkCudaErrors(nvrtc.nvrtcCompileProgram(prog, 2, opts))
+   
+   # Get PTX from compilation
+   ptxSize = checkCudaErrors(nvrtc.nvrtcGetPTXSize(prog))
+   ptx = b" " * ptxSize
+   checkCudaErrors(nvrtc.nvrtcGetPTX(prog, ptx))
+
+Before you can use the PTX or do any work on the GPU, you must create a CUDA
+context. CUDA contexts are analogous to host processes for the device. In the
+following code example, a handle for compute device 0 is passed to
+``cuCtxCreate`` to designate that GPU for context creation:
+
+.. code-block:: python
+
+   # Create context
+   context = checkCudaErrors(driver.cuCtxCreate(0, cuDevice))
+
+With a CUDA context created on device 0, load the PTX generated earlier into a
+module. A module is analogous to dynamically loaded libraries for the device.
+After loading into the module, extract a specific kernel with
+``cuModuleGetFunction``. It is not uncommon for multiple kernels to reside in PTX:
+
+.. code-block:: python
+
+   # Load PTX as module data and retrieve function
+   ptx = np.char.array(ptx)
+   # Note: Incompatible --gpu-architecture would be detected here
+   module = checkCudaErrors(driver.cuModuleLoadData(ptx.ctypes.data))
+   kernel = checkCudaErrors(driver.cuModuleGetFunction(module, b"saxpy"))
+
+Next, get all your data prepared and transferred to the GPU. For increased
+application performance, you can input data on the device to eliminate data
+transfers. For completeness, this example shows how you would transfer data to
+and from the device:
+
+.. code-block:: python
+
+   NUM_THREADS = 512  # Threads per block
+   NUM_BLOCKS = 32768  # Blocks per grid
+   
+   a = np.array([2.0], dtype=np.float32)
+   n = np.array(NUM_THREADS * NUM_BLOCKS, dtype=np.uint32)
+   bufferSize = n * a.itemsize
+   
+   hX = np.random.rand(n).astype(dtype=np.float32)
+   hY = np.random.rand(n).astype(dtype=np.float32)
+   hOut = np.zeros(n).astype(dtype=np.float32)
+
+With the input data ``a``, ``x``, and ``y`` created for the SAXPY transform device,
+resources must be allocated to store the data using ``cuMemAlloc``. To allow for
+more overlap between compute and data movement, use the asynchronous function
+``cuMemcpyHtoDAsync``. It returns control to the CPU immediately following command
+execution.
+
+Python doesn't have a natural concept of pointers, yet ``cuMemcpyHtoDAsync`` expects
+``void*``. This is where we leverage NumPy's data types to retrieve each host data pointer
+by calling ``XX.ctypes.data`` for the associated XX:
+
+.. code-block:: python
+
+   dXclass = checkCudaErrors(driver.cuMemAlloc(bufferSize))
+   dYclass = checkCudaErrors(driver.cuMemAlloc(bufferSize))
+   dOutclass = checkCudaErrors(driver.cuMemAlloc(bufferSize))
+   
+   stream = checkCudaErrors(driver.cuStreamCreate(0))
+   
+   checkCudaErrors(driver.cuMemcpyHtoDAsync(
+      dXclass, hX.ctypes.data, bufferSize, stream
+   ))
+   checkCudaErrors(driver.cuMemcpyHtoDAsync(
+      dYclass, hY.ctypes.data, bufferSize, stream
+   ))
+
+With data prep and resources allocation finished, the kernel is ready to be
+launched. To pass the location of the data on the device to the kernel execution
+configuration, you must retrieve the device pointer. In the following code
+example, we call ``int(XXclass)`` to retrieve the device pointer value for the
+associated XXclass as a Python ``int`` and wrap it in a ``np.array`` type:
+
+.. code-block:: python
+
+   dX = np.array([int(dXclass)], dtype=np.uint64)
+   dY = np.array([int(dYclass)], dtype=np.uint64)
+   dOut = np.array([int(dOutclass)], dtype=np.uint64)
+
+The launch API ``cuLaunchKernel`` also expects a pointer input for the argument list
+but this time it's of type ``void**``. What this means is that our argument list needs to
+be a contiguous array of ``void*`` elements, where each element is the pointer to a kernel
+argument on either host or device. Since we already prepared each of our arguments into a ``np.array`` type, the
+construction of our final contiguous array is done by retrieving the ``XX.ctypes.data``
+of each kernel argument:
+
+.. code-block:: python
+
+   args = [a, dX, dY, dOut, n]
+   args = np.array([arg.ctypes.data for arg in args], dtype=np.uint64)
+
+Now the kernel can be launched:
+
+.. code-block:: python
+
+   checkCudaErrors(driver.cuLaunchKernel(
+      kernel,
+      NUM_BLOCKS,  # grid x dim
+      1,  # grid y dim
+      1,  # grid z dim
+      NUM_THREADS,  # block x dim
+      1,  # block y dim
+      1,  # block z dim
+      0,  # dynamic shared memory
+      stream,  # stream
+      args.ctypes.data,  # kernel arguments
+      0,  # extra (ignore)
+   ))
+   
+   checkCudaErrors(driver.cuMemcpyDtoHAsync(
+      hOut.ctypes.data, dOutclass, bufferSize, stream
+   ))
+   checkCudaErrors(driver.cuStreamSynchronize(stream))
+
+The ``cuLaunchKernel`` function takes the compiled module kernel and execution
+configuration parameters. The device code is launched in the same stream as the
+data transfers. That ensures that the kernel's compute is performed only after
+the data has finished transfer, as all API calls and kernel launches within a
+stream are serialized. After the call to transfer data back to the host is
+executed, ``cuStreamSynchronize`` is used to halt CPU execution until all operations
+in the designated stream are finished:
+
+.. code-block:: python
+
+   # Assert values are same after running kernel
+   hZ = a * hX + hY
+   if not np.allclose(hOut, hZ):
+      raise ValueError("Error outside tolerance for host-device vectors")
+
+Perform verification of the data to ensure correctness and finish the code with
+memory clean up:
+
+.. code-block:: python
+
+   checkCudaErrors(driver.cuStreamDestroy(stream))
+   checkCudaErrors(driver.cuMemFree(dXclass))
+   checkCudaErrors(driver.cuMemFree(dYclass))
+   checkCudaErrors(driver.cuMemFree(dOutclass))
+   checkCudaErrors(driver.cuModuleUnload(module))
+   checkCudaErrors(driver.cuCtxDestroy(context))
+
+Performance
+-----------
+
+Performance is a primary driver in targeting GPUs in your application. So, how
+does the above code compare to its C++ version? Table 1 shows that the results
+are nearly identical. `NVIDIA NSight
+Systems <https://developer.nvidia.com/nsight-systems>`_ was used to retrieve
+kernel performance and `CUDA
+Events <https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/>`_
+was used for application performance.
+
+The following command was used to profile the applications:
+
+.. code-block:: shell
+
+   nsys profile -s none -t cuda --stats=true <executable>
+
+.. list-table:: Kernel and application performance comparison.
+   :header-rows: 1
+
+   * -
+     - C++
+     - Python
+   * - Kernel execution
+     - 352µs
+     - 352µs
+   * - Application execution
+     - 1076ms
+     - 1080ms
+
+``cuda.bindings`` is also compatible with `NVIDIA Nsight
+Compute <https://developer.nvidia.com/nsight-compute>`_, which is an
+interactive kernel profiler for CUDA applications. It allows you to have
+detailed insights into kernel performance. This is useful when you're trying to
+maximize performance ({numref}``Figure 1``).
+
+.. figure:: _static/images/Nsight-Compute-CLI-625x473.png
+   :name: Figure 1
+   
+   Screenshot of Nsight Compute CLI output of ``cuda.bindings`` example.
+
+Preparing kernel arguments
+--------------------------
+
+The ``cuLaunchKernel`` API bindings retain low-level CUDA argument preparation requirements:
+
+* Each kernel argument is a ``void*`` (i.e. pointer to the argument)
+* ``kernelParams`` is a ``void**`` (i.e. pointer to a list of kernel arguments)
+* ``kernelParams`` arguments are in contiguous memory
+
+These requirements can be met with two different approaches, using either NumPy or ctypes.
+
+Using NumPy
+^^^^^^^^^^^
+
+NumPy `Array objects <https://numpy.org/doc/stable/reference/arrays.html>`_ can be used to fulfill each of these conditions directly.
+
+Let's use the following kernel definition as an example:
+
+.. code-block:: python
+
+   kernel_string = """
+   typedef struct {
+       int value;
+   } testStruct;
+   
+   extern "C" __global__
+   void testkernel(int i, int *pi,
+                   float f, float *pf,
+                   testStruct s, testStruct *ps)
+   {
+       *pi = i;
+       *pf = f;
+       ps->value = s.value;
+   }
+   """
+
+The first step is to create array objects with types corresponding to your kernel arguments. Primitive NumPy types have the following corresponding kernel types:
+
+.. list-table:: Correspondence between NumPy types and kernel types.
+   :header-rows: 1
+   
+   * - NumPy type
+     - Corresponding kernel types
+     - itemsize (bytes)
+   * - bool
+     - bool
+     - 1
+   * - int8
+     - char, signed char, int8_t
+     - 1
+   * - int16
+     - short, signed short, int16_t
+     - 2
+   * - int32
+     - int, signed int, int32_t
+     - 4
+   * - int64
+     - long long, signed long long, int64_t
+     - 8
+   * - uint8
+     - unsigned char, uint8_t
+     - 1
+   * - uint16
+     - unsigned short, uint16_t
+     - 2
+   * - uint32
+     - unsigned int, uint32_t
+     - 4
+   * - uint64
+     - unsigned long long, uint64_t
+     - 8
+   * - float16
+     - half
+     - 2
+   * - float32
+     - float
+     - 4
+   * - float64
+     - double
+     - 8
+   * - complex64
+     - float2, cuFloatComplex, complex&lt;float&gt;
+     - 8
+   * - complex128
+     - double2, cuDoubleComplex, complex&lt;double&gt;
+     - 16
+
+Furthermore, custom NumPy types can be used to support both platform-dependent types and user-defined structures as kernel arguments.
+
+This example uses the following types:
+* ``int`` is ``np.uint32``
+* ``float`` is ``np.float32``
+* ``int*``, ``float*`` and ``testStruct*`` are ``np.intp``
+* ``testStruct`` is a custom user type ``np.dtype([("value", np.int32)], align=True)``
+
+Note how all three pointers are ``np.intp`` since the pointer values are always a representation of an address space.
+
+Putting it all together:
+
+.. code-block:: python
+
+   # Define a custom type
+   testStruct = np.dtype([("value", np.int32)], align=True)
+   
+   # Allocate device memory
+   pInt = checkCudaErrors(cudart.cudaMalloc(np.dtype(np.int32).itemsize))
+   pFloat = checkCudaErrors(cudart.cudaMalloc(np.dtype(np.float32).itemsize))
+   pStruct = checkCudaErrors(cudart.cudaMalloc(testStruct.itemsize))
+   
+   # Collect all input kernel arguments into a single tuple for further processing
+   kernelValues = (
+       np.array(1, dtype=np.uint32),
+       np.array([pInt], dtype=np.intp),
+       np.array(123.456, dtype=np.float32),
+       np.array([pFloat], dtype=np.intp),
+       np.array([5], testStruct),
+       np.array([pStruct], dtype=np.intp),
+   )
+
+The final step is to construct a ``kernelParams`` argument that fulfills all of the launch API conditions. This is made easy because each array object comes
+with a `ctypes <https://numpy.org/doc/stable/reference/generated/numpy.ndarray.ctypes.html#numpy.ndarray.ctypes>`_ data attribute that returns the underlying ``void*`` pointer value.
+
+By having the final array object contain all pointers, we fulfill the contiguous array requirement:
+
+.. code-block:: python
+
+   kernelParams = np.array([arg.ctypes.data for arg in kernelValues], dtype=np.intp)
+
+The launch API supports `Buffer Protocol <https://docs.python.org/3/c-api/buffer.html>`_ objects, therefore we can pass the array object directly:
+
+.. code-block:: python
+
+   checkCudaErrors(cuda.cuLaunchKernel(
+       kernel,
+       1, 1, 1,  # grid dim
+       1, 1, 1,  # block dim
+       0, stream,  # shared mem and stream
+       kernelParams=kernelParams,
+       extra=0,
+   ))
+
+Using ctypes
+^^^^^^^^^^^^
+
+The `ctypes <https://docs.python.org/3/library/ctypes.html>`_ approach relaxes the parameter preparation requirement by delegating the contiguous memory requirement to the API launch call.
+
+Let's use the same kernel definition as the previous section for the example.
+
+The ctypes approach treats the ``kernelParams`` argument as a pair of two tuples: ``kernel_values`` and ``kernel_types``.
+
+* ``kernel_values`` contain Python values to be used as an input to your kernel
+* ``kernel_types`` contain the data types that your kernel_values should be converted into
+
+The ctypes `fundamental data types <https://docs.python.org/3/library/ctypes.html#fundamental-data-types>`_ documentation describes the compatibility between different Python types and C types.
+Furthermore, `custom data types <https://docs.python.org/3/library/ctypes.html#calling-functions-with-your-own-custom-data-types>`_ can be used to support kernels with custom types.
+
+For this example the result becomes:
+
+.. code-block:: python
+
+   # Define a custom type
+   class testStruct(ctypes.Structure):
+       _fields_ = [("value", ctypes.c_int)]
+   
+   # Allocate device memory
+   pInt = checkCudaErrors(cudart.cudaMalloc(ctypes.sizeof(ctypes.c_int)))
+   pFloat = checkCudaErrors(cudart.cudaMalloc(ctypes.sizeof(ctypes.c_float)))
+   pStruct = checkCudaErrors(cudart.cudaMalloc(ctypes.sizeof(testStruct)))
+   
+   # Collect all input kernel arguments into a single tuple for further processing
+   kernelValues = (
+       1,
+       pInt,
+       123.456,
+       pFloat,
+       testStruct(5),
+       pStruct,
+   )
+   kernelTypes = (
+       ctypes.c_int,
+       ctypes.c_void_p,
+       ctypes.c_float,
+       ctypes.c_void_p,
+       None,
+       ctypes.c_void_p,
+   )
+
+Values that are set to ``None`` have a special meaning:
+
+1. The value supports a callable ``getPtr`` that returns the pointer address of the underlining C object address (e.g. all CUDA C types that are exposed to Python as Python classes)
+2. The value is an instance of ``ctypes.Structure``
+3. The value is an ``Enum``
+
+In all three cases, the API call will fetch the underlying pointer value and construct a contiguous array with other kernel parameters.
+
+With the setup complete, the kernel can be launched:
+
+.. code-block:: python
+
+   checkCudaErrors(cuda.cuLaunchKernel(
+       kernel,
+       1, 1, 1,  # grid dim
+       1, 1, 1,  # block dim
+       0, stream,  # shared mem and stream
+       kernelParams=(kernelValues, kernelTypes),
+       extra=0,
+   ))
+
+CUDA objects
+^^^^^^^^^^^^
+
+Certain CUDA kernels use native CUDA types as their parameters such as ``cudaTextureObject_t``. These types require special handling since they're neither a primitive ctype nor a custom user type. Since ``cuda.bindings`` exposes each of them as Python classes, they each implement ``getPtr()`` and ``__int__()``. These two callables used to support the NumPy and ctypes approach. The difference between each call is further described under `Tips and Tricks <https://nvidia.github.io/cuda-python/cuda-bindings/latest/tips_and_tricks.html#>`_.
+
+For this example, lets use the ``transformKernel`` from `examples/0_Introduction/simpleCubemapTexture_test.py <https://github.com/NVIDIA/cuda-python/blob/main/cuda_bindings/examples/0_Introduction/simpleCubemapTexture_test.py>`_:
+
+.. code-block:: python
+
+   simpleCubemapTexture = """\
+   extern "C"
+   __global__ void transformKernel(float *g_odata, int width, cudaTextureObject_t tex)
+   {
+       ...
+   }
+   """
+   
+   def main():
+       ...
+       d_data = checkCudaErrors(cudart.cudaMalloc(size))
+       width = 64
+       tex = checkCudaErrors(cudart.cudaCreateTextureObject(texRes, texDescr, None))
+       ...
+
+For NumPy, we can convert these CUDA types by leveraging the ``__int__()`` call to fetch the address of the underlying ``cudaTextureObject_t`` C object and wrapping it in a NumPy object array of type ``np.intp``:
+
+.. code-block:: python
+
+   kernelValues = (
+       np.array([d_data], dtype=np.intp),
+       np.array(width, dtype=np.uint32),
+       np.array([int(tex)], dtype=np.intp),
+   )
+   kernelArgs = np.array([arg.ctypes.data for arg in kernelValues], dtype=np.intp)
+
+For ctypes, we leverage the special handling of ``None`` type since each Python class already implements ``getPtr()``:
+
+.. code-block:: python
+
+   kernelValues = (
+       d_data,
+       width,
+       tex,
+   )
+   kernelTypes = (
+       ctypes.c_void_p,
+       ctypes.c_int,
+       None,
+   )
+   kernelArgs = (kernelValues, kernelTypes)
+
diff --git a/cuda_bindings/docs/source/release.rst b/cuda_bindings/docs/source/release.rst
index 3f0323ccd6..7082d2b708 100644
--- a/cuda_bindings/docs/source/release.rst
+++ b/cuda_bindings/docs/source/release.rst
@@ -14,28 +14,28 @@ Release Notes
    12.9.2 <release/12.9.2-notes.rst>
    12.9.1 <release/12.9.1-notes.rst>
    12.9.0 <release/12.9.0-notes.rst>
-   12.8.0 <release/12.8.0-notes.md>
-   12.6.2 <release/12.6.2-notes.md>
-   12.6.1 <release/12.6.1-notes.md>
-   12.6.0 <release/12.6.0-notes.md>
-   12.5.0 <release/12.5.0-notes.md>
-   12.4.0 <release/12.4.0-notes.md>
-   12.3.0 <release/12.3.0-notes.md>
-   12.2.1 <release/12.2.1-notes.md>
-   12.2.0 <release/12.2.0-notes.md>
-   12.1.0 <release/12.1.0-notes.md>
-   12.0.0 <release/12.0.0-notes.md>
+   12.8.0 <release/12.8.0-notes.rst>
+   12.6.2 <release/12.6.2-notes.rst>
+   12.6.1 <release/12.6.1-notes.rst>
+   12.6.0 <release/12.6.0-notes.rst>
+   12.5.0 <release/12.5.0-notes.rst>
+   12.4.0 <release/12.4.0-notes.rst>
+   12.3.0 <release/12.3.0-notes.rst>
+   12.2.1 <release/12.2.1-notes.rst>
+   12.2.0 <release/12.2.0-notes.rst>
+   12.1.0 <release/12.1.0-notes.rst>
+   12.0.0 <release/12.0.0-notes.rst>
    11.8.7 <release/11.8.7-notes.rst>
-   11.8.6 <release/11.8.6-notes.md>
-   11.8.5 <release/11.8.5-notes.md>
-   11.8.4 <release/11.8.4-notes.md>
-   11.8.3 <release/11.8.3-notes.md>
-   11.8.2 <release/11.8.2-notes.md>
-   11.8.1 <release/11.8.1-notes.md>
-   11.8.0 <release/11.8.0-notes.md>
-   11.7.1 <release/11.7.1-notes.md>
-   11.7.0 <release/11.7.0-notes.md>
-   11.6.1 <release/11.6.1-notes.md>
-   11.6.0 <release/11.6.0-notes.md>
-   11.5.0 <release/11.5.0-notes.md>
-   11.4.0 <release/11.4.0-notes.md>
+   11.8.6 <release/11.8.6-notes.rst>
+   11.8.5 <release/11.8.5-notes.rst>
+   11.8.4 <release/11.8.4-notes.rst>
+   11.8.3 <release/11.8.3-notes.rst>
+   11.8.2 <release/11.8.2-notes.rst>
+   11.8.1 <release/11.8.1-notes.rst>
+   11.8.0 <release/11.8.0-notes.rst>
+   11.7.1 <release/11.7.1-notes.rst>
+   11.7.0 <release/11.7.0-notes.rst>
+   11.6.1 <release/11.6.1-notes.rst>
+   11.6.0 <release/11.6.0-notes.rst>
+   11.5.0 <release/11.5.0-notes.rst>
+   11.4.0 <release/11.4.0-notes.rst>
diff --git a/cuda_bindings/docs/source/release/11.4.0-notes.md b/cuda_bindings/docs/source/release/11.4.0-notes.rst
similarity index 73%
rename from cuda_bindings/docs/source/release/11.4.0-notes.md
rename to cuda_bindings/docs/source/release/11.4.0-notes.rst
index 9eaa4eff0a..c019aedd95 100644
--- a/cuda_bindings/docs/source/release/11.4.0-notes.md
+++ b/cuda_bindings/docs/source/release/11.4.0-notes.rst
@@ -1,18 +1,25 @@
-# CUDA Python 11.4.0 Release notes
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 11.4.0 Release notes
+================================
 
 Released on August 16, 2021
 
-## Highlights
+Highlights
+----------
 - Initial EA release for CUDA Python
 - Supports all platforms that CUDA is supported
 - Supports all CUDA 11.x releases
 - Low-level CUDA Cython bindings and Python wrappers
 
-## Limitations
+Limitations
+-----------
 
 - Source code release only; Python packages coming in a future release.
 
-### CUDA Functions Not Supported in this Release
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - cudaGetTextureReference
 - cudaGetSurfaceReference
diff --git a/cuda_bindings/docs/source/release/11.5.0-notes.md b/cuda_bindings/docs/source/release/11.5.0-notes.rst
similarity index 89%
rename from cuda_bindings/docs/source/release/11.5.0-notes.md
rename to cuda_bindings/docs/source/release/11.5.0-notes.rst
index 130cb17d07..17cb02e0ca 100644
--- a/cuda_bindings/docs/source/release/11.5.0-notes.md
+++ b/cuda_bindings/docs/source/release/11.5.0-notes.rst
@@ -1,8 +1,13 @@
-# CUDA Python 11.5.0 Release notes
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 11.5.0 Release notes
+================================
 
 Released on October 18, 2021
 
-## Highlights
+Highlights
+----------
 - PyPi support
 - Conda support
 - GA release for CUDA Python
@@ -10,11 +15,13 @@ Released on October 18, 2021
 - Supports all CUDA 11.x releases
 - Low-level CUDA Cython bindings and Python wrappers
 
-## Limitations
+Limitations
+-----------
 
 - Changing default stream not supported; coming in future release
 
-### CUDA Functions Not Supported in this Release
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - cudaGetTextureReference
 - cudaGetSurfaceReference
diff --git a/cuda_bindings/docs/source/release/11.6.0-notes.md b/cuda_bindings/docs/source/release/11.6.0-notes.md
deleted file mode 100644
index 664da16249..0000000000
--- a/cuda_bindings/docs/source/release/11.6.0-notes.md
+++ /dev/null
@@ -1,73 +0,0 @@
-# CUDA Python 11.6.0 Release notes
-
-Released on Januray 12, 2022
-
-## Highlights
-- Support CUDA Toolkit 11.6
-- Support Profiler APIs
-- Support Graphic APIs (EGL, GL, VDPAU)
-- Support changing default stream
-- Relaxed primitive interoperability
-
-### Default stream
-
-Changing default stream to Per-Thread-Default-Stream (PTDS) is done through environment variable before execution:
-
-```{code-block} shell
-export CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1
-```
-
-When set to 1, the default stream is the per-thread default stream. When set to 0, the default stream is the legacy default stream. This defaults to 0, for the legacy default stream. See [Stream Synchronization Behavior](https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html) for an explanation of the legacy and per-thread default streams.
-
-### Primitive interoperability
-
-APIs accepting classes that wrap a primitive value are now interoperable with the underlining value.
-
-Example 1: Structure member handles interoperability.
-
-```{code-block} python
->>> waitParams = cuda.CUstreamMemOpWaitValueParams_st()
->>> waitParams.value64 = 1
->>> waitParams.value64
-<cuuint64_t 1>
->>> waitParams.value64 = cuda.cuuint64_t(2)
->>> waitParams.value64
-<cuuint64_t 2>
-```
-
-Example 2: Function signature handles interoperability.
-
-```{code-block} python
->>> cudart.cudaStreamQuery(cudart.cudaStreamNonBlocking)
-(<cudaError_t.cudaSuccess: 0>,)
->>> cudart.cudaStreamQuery(cudart.cudaStream_t(cudart.cudaStreamNonBlocking))
-(<cudaError_t.cudaSuccess: 0>,)
-```
-
-## Limitations
-
-### CUDA Functions Not Supported in this Release
-
-- Symbol APIs
-    - cudaGraphExecMemcpyNodeSetParamsFromSymbol
-    - cudaGraphExecMemcpyNodeSetParamsToSymbol
-    - cudaGraphAddMemcpyNodeToSymbol
-    - cudaGraphAddMemcpyNodeFromSymbol
-    - cudaGraphMemcpyNodeSetParamsToSymbol
-    - cudaGraphMemcpyNodeSetParamsFromSymbol
-    - cudaMemcpyToSymbol
-    - cudaMemcpyFromSymbol
-    - cudaMemcpyToSymbolAsync
-    - cudaMemcpyFromSymbolAsync
-    - cudaGetSymbolAddress
-    - cudaGetSymbolSize
-    - cudaGetFuncBySymbol
-- Launch Options
-    - cudaLaunchKernel
-    - cudaLaunchCooperativeKernel
-    - cudaLaunchCooperativeKernelMultiDevice
-- cudaSetValidDevices
-- cudaVDPAUSetVDPAUDevice
-
-```{note} Deprecated APIs are removed from tracking
-```
diff --git a/cuda_bindings/docs/source/release/11.6.0-notes.rst b/cuda_bindings/docs/source/release/11.6.0-notes.rst
new file mode 100644
index 0000000000..d7907df847
--- /dev/null
+++ b/cuda_bindings/docs/source/release/11.6.0-notes.rst
@@ -0,0 +1,82 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 11.6.0 Release notes
+================================
+
+Released on Januray 12, 2022
+
+Highlights
+----------
+- Support CUDA Toolkit 11.6
+- Support Profiler APIs
+- Support Graphic APIs (EGL, GL, VDPAU)
+- Support changing default stream
+- Relaxed primitive interoperability
+
+Default stream
+^^^^^^^^^^^^^^
+
+Changing default stream to Per-Thread-Default-Stream (PTDS) is done through environment variable before execution:
+
+.. code-block:: shell
+
+   export CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1
+
+When set to 1, the default stream is the per-thread default stream. When set to 0, the default stream is the legacy default stream. This defaults to 0, for the legacy default stream. See `Stream Synchronization Behavior <https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html>`_ for an explanation of the legacy and per-thread default streams.
+
+Primitive interoperability
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+APIs accepting classes that wrap a primitive value are now interoperable with the underlining value.
+
+Example 1: Structure member handles interoperability.
+
+.. code-block:: python
+
+   >>> waitParams = cuda.CUstreamMemOpWaitValueParams_st()
+   >>> waitParams.value64 = 1
+   >>> waitParams.value64
+   <cuuint64_t 1>
+   >>> waitParams.value64 = cuda.cuuint64_t(2)
+   >>> waitParams.value64
+   <cuuint64_t 2>
+
+Example 2: Function signature handles interoperability.
+
+.. code-block:: python
+
+   >>> cudart.cudaStreamQuery(cudart.cudaStreamNonBlocking)
+   (<cudaError_t.cudaSuccess: 0>,)
+   >>> cudart.cudaStreamQuery(cudart.cudaStream_t(cudart.cudaStreamNonBlocking))
+   (<cudaError_t.cudaSuccess: 0>,)
+
+Limitations
+-----------
+
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+- Symbol APIs
+    - cudaGraphExecMemcpyNodeSetParamsFromSymbol
+    - cudaGraphExecMemcpyNodeSetParamsToSymbol
+    - cudaGraphAddMemcpyNodeToSymbol
+    - cudaGraphAddMemcpyNodeFromSymbol
+    - cudaGraphMemcpyNodeSetParamsToSymbol
+    - cudaGraphMemcpyNodeSetParamsFromSymbol
+    - cudaMemcpyToSymbol
+    - cudaMemcpyFromSymbol
+    - cudaMemcpyToSymbolAsync
+    - cudaMemcpyFromSymbolAsync
+    - cudaGetSymbolAddress
+    - cudaGetSymbolSize
+    - cudaGetFuncBySymbol
+- Launch Options
+    - cudaLaunchKernel
+    - cudaLaunchCooperativeKernel
+    - cudaLaunchCooperativeKernelMultiDevice
+- cudaSetValidDevices
+- cudaVDPAUSetVDPAUDevice
+
+.. note:: Deprecated APIs are removed from tracking
+
diff --git a/cuda_bindings/docs/source/release/11.6.1-notes.md b/cuda_bindings/docs/source/release/11.6.1-notes.rst
similarity index 66%
rename from cuda_bindings/docs/source/release/11.6.1-notes.md
rename to cuda_bindings/docs/source/release/11.6.1-notes.rst
index ddd6ff5101..f136c94224 100644
--- a/cuda_bindings/docs/source/release/11.6.1-notes.md
+++ b/cuda_bindings/docs/source/release/11.6.1-notes.rst
@@ -1,13 +1,20 @@
-# CUDA Python 11.6.1 Release notes
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 11.6.1 Release notes
+================================
 
 Released on March 18, 2022
 
-## Highlights
+Highlights
+----------
 - Fix string decomposition for WSL library load
 
-## Limitations
+Limitations
+-----------
 
-### CUDA Functions Not Supported in this Release
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - Symbol APIs
     - cudaGraphExecMemcpyNodeSetParamsFromSymbol
diff --git a/cuda_bindings/docs/source/release/11.7.0-notes.md b/cuda_bindings/docs/source/release/11.7.0-notes.rst
similarity index 65%
rename from cuda_bindings/docs/source/release/11.7.0-notes.md
rename to cuda_bindings/docs/source/release/11.7.0-notes.rst
index 22500c7a23..1f850c4283 100644
--- a/cuda_bindings/docs/source/release/11.7.0-notes.md
+++ b/cuda_bindings/docs/source/release/11.7.0-notes.rst
@@ -1,13 +1,20 @@
-# CUDA Python 11.7.0 Release notes
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 11.7.0 Release notes
+================================
 
 Released on May 11, 2022
 
-## Highlights
+Highlights
+----------
 - Support CUDA Toolkit 11.7
 
-## Limitations
+Limitations
+-----------
 
-### CUDA Functions Not Supported in this Release
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - Symbol APIs
     - cudaGraphExecMemcpyNodeSetParamsFromSymbol
diff --git a/cuda_bindings/docs/source/release/11.7.1-notes.md b/cuda_bindings/docs/source/release/11.7.1-notes.rst
similarity index 64%
rename from cuda_bindings/docs/source/release/11.7.1-notes.md
rename to cuda_bindings/docs/source/release/11.7.1-notes.rst
index 2997c9da56..0fbea248e3 100644
--- a/cuda_bindings/docs/source/release/11.7.1-notes.md
+++ b/cuda_bindings/docs/source/release/11.7.1-notes.rst
@@ -1,20 +1,27 @@
-# CUDA Python 11.7.1 Release notes
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 11.7.1 Release notes
+================================
 
 Released on June 29, 2022
 
-## Highlights
+Highlights
+----------
 - Fix error propagation in CUDA Runtime bindings
-- Resolves [issue #22](https://github.com/NVIDIA/cuda-python/issues/22)
+- Resolves `issue #22 <https://github.com/NVIDIA/cuda-python/issues/22>`_
 
-## Limitations
+Limitations
+-----------
 
-### Source builds
+Source builds
+^^^^^^^^^^^^^
 
 CUDA Python no longer re-declares CUDA types, instead it uses the types from CUDA C headers. As such source builds now need to access to latest CTK headers. In particular:
 1. "$CUDA_HOME/include" has latest CTK headers
 2. CTK headers have all types defined
 
-(2) Certain CUDA types are not declared on mobile platforms and may face a "has not been declared" error during source builds. A temporary workaround is to use the headers found in [https://gitlab.com/nvidia/headers/cuda](https://gitlab.com/nvidia/headers/cuda). In particular CUDA Python needs the following headers and their dependencies:
+(2) Certain CUDA types are not declared on mobile platforms and may face a "has not been declared" error during source builds. A temporary workaround is to use the headers found in `https://gitlab.com/nvidia/headers/cuda <https://gitlab.com/nvidia/headers/cuda>`_. In particular CUDA Python needs the following headers and their dependencies:
 - cuda.h
 - cudaProfiler.h
 - driver_types.h
@@ -23,7 +30,8 @@ CUDA Python no longer re-declares CUDA types, instead it uses the types from CUD
 
 This a short-term limitation and will be relaxed in a future release.
 
-### CUDA Functions Not Supported in this Release
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - Symbol APIs
     - cudaGraphExecMemcpyNodeSetParamsFromSymbol
diff --git a/cuda_bindings/docs/source/release/11.8.0-notes.md b/cuda_bindings/docs/source/release/11.8.0-notes.rst
similarity index 62%
rename from cuda_bindings/docs/source/release/11.8.0-notes.md
rename to cuda_bindings/docs/source/release/11.8.0-notes.rst
index c5bf9f71c3..e24022142d 100644
--- a/cuda_bindings/docs/source/release/11.8.0-notes.md
+++ b/cuda_bindings/docs/source/release/11.8.0-notes.rst
@@ -1,22 +1,30 @@
-# CUDA Python 11.8.0 Release notes
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 11.8.0 Release notes
+================================
 
 Released on October 3, 2022
 
-## Highlights
+Highlights
+----------
 - Support CUDA Toolkit 11.8
 - Source builds allow for missing types and APIs
 - Resolves source builds for mobile platforms
-- Resolves [issue #24](https://github.com/NVIDIA/cuda-python/issues/24)
+- Resolves `issue #24 <https://github.com/NVIDIA/cuda-python/issues/24>`_
 
-### Source Builds
+Source Builds
+^^^^^^^^^^^^^
 
-CUDA Python source builds now parse CUDA headers located in $CUDA_HOME directory, enabling/disabling types and APIs if defined. Therefore this removes the need for CTK headers to have all types defined. By allowing minor variations, previous [11.7.1 mobile platform workaround](https://nvidia.github.io/cuda-python/release/11.7.1-notes.html#source-builds) is no longer needed.
+CUDA Python source builds now parse CUDA headers located in $CUDA_HOME directory, enabling/disabling types and APIs if defined. Therefore this removes the need for CTK headers to have all types defined. By allowing minor variations, previous `11.7.1 mobile platform workaround <https://nvidia.github.io/cuda-python/release/11.7.1-notes.html#source-builds>`_ is no longer needed.
 
 It's still required that source builds use the latest CTK headers (i.e. “$CUDA_HOME/include” has latest CTK headers).
 
-## Limitations
+Limitations
+-----------
 
-### CUDA Functions Not Supported in this Release
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - Symbol APIs
     - cudaGraphExecMemcpyNodeSetParamsFromSymbol
diff --git a/cuda_bindings/docs/source/release/11.8.1-notes.md b/cuda_bindings/docs/source/release/11.8.1-notes.rst
similarity index 61%
rename from cuda_bindings/docs/source/release/11.8.1-notes.md
rename to cuda_bindings/docs/source/release/11.8.1-notes.rst
index f7c2e7d450..0df23c9299 100644
--- a/cuda_bindings/docs/source/release/11.8.1-notes.md
+++ b/cuda_bindings/docs/source/release/11.8.1-notes.rst
@@ -1,14 +1,21 @@
-# CUDA Python 11.8.1 Release notes
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 11.8.1 Release notes
+================================
 
 Released on November 4, 2022
 
-## Highlights
-- Resolves [issue #27](https://github.com/NVIDIA/cuda-python/issues/27)
+Highlights
+----------
+- Resolves `issue #27 <https://github.com/NVIDIA/cuda-python/issues/27>`_
 - Update install instructions to use latest CTK
 
-## Limitations
+Limitations
+-----------
 
-### CUDA Functions Not Supported in this Release
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - Symbol APIs
     - cudaGraphExecMemcpyNodeSetParamsFromSymbol
diff --git a/cuda_bindings/docs/source/release/11.8.2-notes.md b/cuda_bindings/docs/source/release/11.8.2-notes.rst
similarity index 65%
rename from cuda_bindings/docs/source/release/11.8.2-notes.md
rename to cuda_bindings/docs/source/release/11.8.2-notes.rst
index f9d1655652..ec9f0324e5 100644
--- a/cuda_bindings/docs/source/release/11.8.2-notes.md
+++ b/cuda_bindings/docs/source/release/11.8.2-notes.rst
@@ -1,13 +1,20 @@
-# CUDA Python 11.8.2 Release notes
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 11.8.2 Release notes
+================================
 
 Released on May 18, 2023
 
-## Highlights
+Highlights
+----------
 - Open libcuda.so.1 instead of libcuda.so
 
-## Limitations
+Limitations
+-----------
 
-### CUDA Functions Not Supported in this Release
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - Symbol APIs
     - cudaGraphExecMemcpyNodeSetParamsFromSymbol
diff --git a/cuda_bindings/docs/source/release/11.8.3-notes.md b/cuda_bindings/docs/source/release/11.8.3-notes.rst
similarity index 67%
rename from cuda_bindings/docs/source/release/11.8.3-notes.md
rename to cuda_bindings/docs/source/release/11.8.3-notes.rst
index a8ff840c1e..806f5eb1b3 100644
--- a/cuda_bindings/docs/source/release/11.8.3-notes.md
+++ b/cuda_bindings/docs/source/release/11.8.3-notes.rst
@@ -1,15 +1,22 @@
-# CUDA Python 11.8.3 Release notes
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 11.8.3 Release notes
+================================
 
 Released on October 23, 2023
 
-## Highlights
+Highlights
+----------
 - Compatability with Cython 3
 - New API cudart.getLocalRuntimeVersion()
 - Modernize build config
 
-## Limitations
+Limitations
+-----------
 
-### CUDA Functions Not Supported in this Release
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - Symbol APIs
     - cudaGraphExecMemcpyNodeSetParamsFromSymbol
diff --git a/cuda_bindings/docs/source/release/11.8.4-notes.md b/cuda_bindings/docs/source/release/11.8.4-notes.md
deleted file mode 100644
index 13767998f0..0000000000
--- a/cuda_bindings/docs/source/release/11.8.4-notes.md
+++ /dev/null
@@ -1,54 +0,0 @@
-# CUDA Python 11.8.4 Release notes
-
-Released on October 7, 2024
-
-## Highlights
-- Resolve [Issue #89](https://github.com/NVIDIA/cuda-python/issues/89): Fix getLocalRuntimeVersion searching for wrong libcudart version
-- Resolve [Issue #90](https://github.com/NVIDIA/cuda-python/issues/90): Use new layout in preperation for cuda-python becoming a metapackage
-
-## CUDA namespace cleanup with a new module layout
-
-[Issue #75](https://github.com/NVIDIA/cuda-python/issues/75) explains in detail what the new module layout is, what problem it fixes and how it impacts the users. However for the sake of completeness, this release notes will highlight key points of this change.
-
-Before this change, `cuda-python` was tightly coupled to CUDA Toolkit releases and all new features would inherit this coupling regardless of their applicability. As we develop new features, this coupling was becoming overly restrictive and motivated a new solution: Convert `cuda-python` into a metapackage where we use `cuda` as a namespace with existing bindings code moved to a `cuda_bindings` subpackage.
-
-This patch release applies the new module layout for the bindings as follows:
-- `cuda.cuda` -> `cuda.bindings.driver`
-- `cuda.ccuda` -> `cuda.bindings.cydriver`
-- `cuda.cudart` -> `cuda.bindings.runtime`
-- `cuda.ccudart` -> `cuda.bindings.cyruntime`
-- `cuda.nvrtc` -> `cuda.bindings.nvrtc`
-- `cuda.cnvrtc` -> `cuda.bindings.cynvrtc`
-
-Deprecation warnings are turned on as a notice to switch to the new module layout.
-
-```{note} This is non-breaking, backwards compatible change. All old module path will continue work as they "forward" user calls towards the new layout.
-```
-
-## Limitations
-
-### Know issues
-- [Issue #215](https://github.com/NVIDIA/cuda-python/issues/215)
-
-### CUDA Functions Not Supported in this Release
-
-- Symbol APIs
-    - cudaGraphExecMemcpyNodeSetParamsFromSymbol
-    - cudaGraphExecMemcpyNodeSetParamsToSymbol
-    - cudaGraphAddMemcpyNodeToSymbol
-    - cudaGraphAddMemcpyNodeFromSymbol
-    - cudaGraphMemcpyNodeSetParamsToSymbol
-    - cudaGraphMemcpyNodeSetParamsFromSymbol
-    - cudaMemcpyToSymbol
-    - cudaMemcpyFromSymbol
-    - cudaMemcpyToSymbolAsync
-    - cudaMemcpyFromSymbolAsync
-    - cudaGetSymbolAddress
-    - cudaGetSymbolSize
-    - cudaGetFuncBySymbol
-- Launch Options
-    - cudaLaunchKernel
-    - cudaLaunchCooperativeKernel
-    - cudaLaunchCooperativeKernelMultiDevice
-- cudaSetValidDevices
-- cudaVDPAUSetVDPAUDevice
diff --git a/cuda_bindings/docs/source/release/11.8.4-notes.rst b/cuda_bindings/docs/source/release/11.8.4-notes.rst
new file mode 100644
index 0000000000..6bafd0b63c
--- /dev/null
+++ b/cuda_bindings/docs/source/release/11.8.4-notes.rst
@@ -0,0 +1,62 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 11.8.4 Release notes
+================================
+
+Released on October 7, 2024
+
+Highlights
+----------
+- Resolve `Issue #89 <https://github.com/NVIDIA/cuda-python/issues/89>`_: Fix getLocalRuntimeVersion searching for wrong libcudart version
+- Resolve `Issue #90 <https://github.com/NVIDIA/cuda-python/issues/90>`_: Use new layout in preperation for cuda-python becoming a metapackage
+
+CUDA namespace cleanup with a new module layout
+-----------------------------------------------
+
+`Issue #75 <https://github.com/NVIDIA/cuda-python/issues/75>`_ explains in detail what the new module layout is, what problem it fixes and how it impacts the users. However for the sake of completeness, this release notes will highlight key points of this change.
+
+Before this change, ``cuda-python`` was tightly coupled to CUDA Toolkit releases and all new features would inherit this coupling regardless of their applicability. As we develop new features, this coupling was becoming overly restrictive and motivated a new solution: Convert ``cuda-python`` into a metapackage where we use ``cuda`` as a namespace with existing bindings code moved to a ``cuda_bindings`` subpackage.
+
+This patch release applies the new module layout for the bindings as follows:
+- ``cuda.cuda`` -> ``cuda.bindings.driver``
+- ``cuda.ccuda`` -> ``cuda.bindings.cydriver``
+- ``cuda.cudart`` -> ``cuda.bindings.runtime``
+- ``cuda.ccudart`` -> ``cuda.bindings.cyruntime``
+- ``cuda.nvrtc`` -> ``cuda.bindings.nvrtc``
+- ``cuda.cnvrtc`` -> ``cuda.bindings.cynvrtc``
+
+Deprecation warnings are turned on as a notice to switch to the new module layout.
+
+.. note:: This is non-breaking, backwards compatible change. All old module path will continue work as they "forward" user calls towards the new layout.
+
+Limitations
+-----------
+
+Know issues
+^^^^^^^^^^^
+- `Issue #215 <https://github.com/NVIDIA/cuda-python/issues/215>`_
+
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+- Symbol APIs
+    - cudaGraphExecMemcpyNodeSetParamsFromSymbol
+    - cudaGraphExecMemcpyNodeSetParamsToSymbol
+    - cudaGraphAddMemcpyNodeToSymbol
+    - cudaGraphAddMemcpyNodeFromSymbol
+    - cudaGraphMemcpyNodeSetParamsToSymbol
+    - cudaGraphMemcpyNodeSetParamsFromSymbol
+    - cudaMemcpyToSymbol
+    - cudaMemcpyFromSymbol
+    - cudaMemcpyToSymbolAsync
+    - cudaMemcpyFromSymbolAsync
+    - cudaGetSymbolAddress
+    - cudaGetSymbolSize
+    - cudaGetFuncBySymbol
+- Launch Options
+    - cudaLaunchKernel
+    - cudaLaunchCooperativeKernel
+    - cudaLaunchCooperativeKernelMultiDevice
+- cudaSetValidDevices
+- cudaVDPAUSetVDPAUDevice
diff --git a/cuda_bindings/docs/source/release/11.8.5-notes.md b/cuda_bindings/docs/source/release/11.8.5-notes.rst
similarity index 53%
rename from cuda_bindings/docs/source/release/11.8.5-notes.md
rename to cuda_bindings/docs/source/release/11.8.5-notes.rst
index 37498b115f..7580d468b4 100644
--- a/cuda_bindings/docs/source/release/11.8.5-notes.md
+++ b/cuda_bindings/docs/source/release/11.8.5-notes.rst
@@ -1,15 +1,21 @@
-# CUDA Python 11.8.5 Release notes
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
 
-Released on November 5, 2024. Post 1 rebuild released on November 12, 2024.
+CUDA Python 11.8.5 Release notes
+================================
 
-## Highlights
-- Resolve [Issue #215](https://github.com/NVIDIA/cuda-python/issues/215): module `cuda.ccudart` has no attribute `__pyx_capi__`
-- Resolve [Issue #226](https://github.com/NVIDIA/cuda-python/issues/226): top-level Cython source files not packaged
+Released on November 5, 2024. Post 1 rebuild released on November 12, 2024.
 
+Highlights
+----------
+- Resolve `Issue #215 <https://github.com/NVIDIA/cuda-python/issues/215>`_: module ``cuda.ccudart`` has no attribute ``__pyx_capi__``
+- Resolve `Issue #226 <https://github.com/NVIDIA/cuda-python/issues/226>`_: top-level Cython source files not packaged
 
-## Limitations
+Limitations
+-----------
 
-### CUDA Functions Not Supported in this Release
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - Symbol APIs
     - cudaGraphExecMemcpyNodeSetParamsFromSymbol
diff --git a/cuda_bindings/docs/source/release/11.8.6-notes.md b/cuda_bindings/docs/source/release/11.8.6-notes.md
deleted file mode 100644
index cdbc82e3d2..0000000000
--- a/cuda_bindings/docs/source/release/11.8.6-notes.md
+++ /dev/null
@@ -1,29 +0,0 @@
-# `cuda-bindings` 11.8.6 Release notes
-
-Released on January 24, 2025.
-
-
-## Highlights
-
-- Support Python 3.13
-- Add an optional dependency on the CUDA NVRTC wheel
-- Enable discovery and loading of shared libraries from CUDA wheels
-- `cuda-python` is now a meta package, currently depending only on `cuda-bindings` ([see RFC](https://github.com/NVIDIA/cuda-python/issues/105))
-
-
-## Wheels support for optional dependencies
-
-Optional dependencies are added for packages:
-
-- nvidia-cuda-nvrtc-cu12
-
-Installing these dependencies with `cuda-python` can be done using:
-```{code-block} shell
-pip install cuda-python[all]
-```
-Same applies to `cuda-bindings`.
-
-
-## Discovery and loading of shared library dependencies from wheels
-
-Shared library search paths for wheel builds are now extended to check site-packages. This allows `cuda-python`/`cuda-bindings` to seamlessly use the aforementioned CUDA Toolkit wheels installed in the user's Python environment.
diff --git a/cuda_bindings/docs/source/release/11.8.6-notes.rst b/cuda_bindings/docs/source/release/11.8.6-notes.rst
new file mode 100644
index 0000000000..9ab6db2d50
--- /dev/null
+++ b/cuda_bindings/docs/source/release/11.8.6-notes.rst
@@ -0,0 +1,35 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+``cuda-bindings`` 11.8.6 Release notes
+====================================
+
+Released on January 24, 2025.
+
+Highlights
+----------
+
+- Support Python 3.13
+- Add an optional dependency on the CUDA NVRTC wheel
+- Enable discovery and loading of shared libraries from CUDA wheels
+- ``cuda-python`` is now a meta package, currently depending only on ``cuda-bindings`` (`see RFC <https://github.com/NVIDIA/cuda-python/issues/105>`_)
+
+Wheels support for optional dependencies
+----------------------------------------
+
+Optional dependencies are added for packages:
+
+- nvidia-cuda-nvrtc-cu12
+
+Installing these dependencies with ``cuda-python`` can be done using:
+
+.. code-block:: shell
+
+   pip install cuda-python[all]
+
+Same applies to ``cuda-bindings``.
+
+Discovery and loading of shared library dependencies from wheels
+----------------------------------------------------------------
+
+Shared library search paths for wheel builds are now extended to check site-packages. This allows ``cuda-python``/``cuda-bindings`` to seamlessly use the aforementioned CUDA Toolkit wheels installed in the user's Python environment.
diff --git a/cuda_bindings/docs/source/release/12.0.0-notes.md b/cuda_bindings/docs/source/release/12.0.0-notes.rst
similarity index 57%
rename from cuda_bindings/docs/source/release/12.0.0-notes.md
rename to cuda_bindings/docs/source/release/12.0.0-notes.rst
index 9f2ae25871..b61741a24c 100644
--- a/cuda_bindings/docs/source/release/12.0.0-notes.md
+++ b/cuda_bindings/docs/source/release/12.0.0-notes.rst
@@ -1,15 +1,22 @@
-# CUDA Python 12.0.0 Release notes
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 12.0.0 Release notes
+================================
 
 Released on December 8, 2022
 
-## Highlights
+Highlights
+----------
 - Rebase to CUDA Toolkit 12.0
-- Fix example from [MR28](https://github.com/NVIDIA/cuda-python/pull/28)
-- Apply [MR35](https://github.com/NVIDIA/cuda-python/pull/35)
+- Fix example from `MR28 <https://github.com/NVIDIA/cuda-python/pull/28>`_
+- Apply `MR35 <https://github.com/NVIDIA/cuda-python/pull/35>`_
 
-## Limitations
+Limitations
+-----------
 
-### CUDA Functions Not Supported in this Release
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - Symbol APIs
     - cudaGraphExecMemcpyNodeSetParamsFromSymbol
diff --git a/cuda_bindings/docs/source/release/12.1.0-notes.md b/cuda_bindings/docs/source/release/12.1.0-notes.rst
similarity index 51%
rename from cuda_bindings/docs/source/release/12.1.0-notes.md
rename to cuda_bindings/docs/source/release/12.1.0-notes.rst
index 94310bb513..161b4596cb 100644
--- a/cuda_bindings/docs/source/release/12.1.0-notes.md
+++ b/cuda_bindings/docs/source/release/12.1.0-notes.rst
@@ -1,16 +1,23 @@
-# CUDA Python 12.1.0 Release notes
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 12.1.0 Release notes
+================================
 
 Released on February 28, 2023
 
-## Highlights
+Highlights
+----------
 - Rebase to CUDA Toolkit 12.1
-- Resolve [Issue #41](https://github.com/NVIDIA/cuda-python/issues/41): Add support for Python 3.11
-- Resolve [Issue #42](https://github.com/NVIDIA/cuda-python/issues/42): Dropping Python 3.7
-- Resolve [Issue #43](https://github.com/NVIDIA/cuda-python/issues/43): Trim Conda package dependencies
+- Resolve `Issue #41 <https://github.com/NVIDIA/cuda-python/issues/41>`_: Add support for Python 3.11
+- Resolve `Issue #42 <https://github.com/NVIDIA/cuda-python/issues/42>`_: Dropping Python 3.7
+- Resolve `Issue #43 <https://github.com/NVIDIA/cuda-python/issues/43>`_: Trim Conda package dependencies
 
-## Limitations
+Limitations
+-----------
 
-### CUDA Functions Not Supported in this Release
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - Symbol APIs
     - cudaGraphExecMemcpyNodeSetParamsFromSymbol
diff --git a/cuda_bindings/docs/source/release/12.2.0-notes.md b/cuda_bindings/docs/source/release/12.2.0-notes.rst
similarity index 53%
rename from cuda_bindings/docs/source/release/12.2.0-notes.md
rename to cuda_bindings/docs/source/release/12.2.0-notes.rst
index 39e37b9a8d..796aaa1e52 100644
--- a/cuda_bindings/docs/source/release/12.2.0-notes.md
+++ b/cuda_bindings/docs/source/release/12.2.0-notes.rst
@@ -1,15 +1,22 @@
-# CUDA Python 12.2.0 Release notes
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 12.2.0 Release notes
+================================
 
 Released on June 28, 2023
 
-## Highlights
+Highlights
+----------
 - Rebase to CUDA Toolkit 12.2
-- Resolve [Issue #44](https://github.com/NVIDIA/cuda-python/issues/44): nogil must be at the end of the function signature line
-- Resolve [Issue #45](https://github.com/NVIDIA/cuda-python/issues/45): Error with pyparsing when no CUDA is found
+- Resolve `Issue #44 <https://github.com/NVIDIA/cuda-python/issues/44>`_: nogil must be at the end of the function signature line
+- Resolve `Issue #45 <https://github.com/NVIDIA/cuda-python/issues/45>`_: Error with pyparsing when no CUDA is found
 
-## Limitations
+Limitations
+-----------
 
-### CUDA Functions Not Supported in this Release
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - Symbol APIs
     - cudaGraphExecMemcpyNodeSetParamsFromSymbol
diff --git a/cuda_bindings/docs/source/release/12.2.1-notes.md b/cuda_bindings/docs/source/release/12.2.1-notes.rst
similarity index 65%
rename from cuda_bindings/docs/source/release/12.2.1-notes.md
rename to cuda_bindings/docs/source/release/12.2.1-notes.rst
index 3a89af85c2..3ccacdd30e 100644
--- a/cuda_bindings/docs/source/release/12.2.1-notes.md
+++ b/cuda_bindings/docs/source/release/12.2.1-notes.rst
@@ -1,13 +1,20 @@
-# CUDA Python 12.2.1 Release notes
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 12.2.1 Release notes
+================================
 
 Released on January 8, 2024
 
-## Highlights
+Highlights
+----------
 - Compatibility with Cython 3
 
-## Limitations
+Limitations
+-----------
 
-### CUDA Functions Not Supported in this Release
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - Symbol APIs
     - cudaGraphExecMemcpyNodeSetParamsFromSymbol
diff --git a/cuda_bindings/docs/source/release/12.3.0-notes.md b/cuda_bindings/docs/source/release/12.3.0-notes.md
deleted file mode 100644
index 15bcdb9780..0000000000
--- a/cuda_bindings/docs/source/release/12.3.0-notes.md
+++ /dev/null
@@ -1,36 +0,0 @@
-# CUDA Python 12.3.0 Release notes
-
-Released on October 19, 2023
-
-## Highlights
-- Rebase to CUDA Toolkit 12.3
-- Resolve [Issue #16](https://github.com/NVIDIA/cuda-python/issues/16): cuda.cudart.cudaRuntimeGetVersion() hard-codes the runtime version, rather than querying the runtime
-    - New API cudart.getLocalRuntimeVersion()
-- Resolve [Issue #48](https://github.com/NVIDIA/cuda-python/issues/48): Dropping Python 3.8
-- Resolve [Issue #51](https://github.com/NVIDIA/cuda-python/issues/51): Dropping package releases for ppc64 on PYPI and conda-nvidia channel
-
-## Limitations
-
-### CUDA Functions Not Supported in this Release
-
-- Symbol APIs
-    - cudaGraphExecMemcpyNodeSetParamsFromSymbol
-    - cudaGraphExecMemcpyNodeSetParamsToSymbol
-    - cudaGraphAddMemcpyNodeToSymbol
-    - cudaGraphAddMemcpyNodeFromSymbol
-    - cudaGraphMemcpyNodeSetParamsToSymbol
-    - cudaGraphMemcpyNodeSetParamsFromSymbol
-    - cudaMemcpyToSymbol
-    - cudaMemcpyFromSymbol
-    - cudaMemcpyToSymbolAsync
-    - cudaMemcpyFromSymbolAsync
-    - cudaGetSymbolAddress
-    - cudaGetSymbolSize
-    - cudaGetFuncBySymbol
-- Launch Options
-    - cudaLaunchKernel
-    - cudaLaunchCooperativeKernel
-    - cudaLaunchCooperativeKernelMultiDevice
-- cudaSetValidDevices
-- cudaVDPAUSetVDPAUDevice
-- cudaFuncGetName
diff --git a/cuda_bindings/docs/source/release/12.3.0-notes.rst b/cuda_bindings/docs/source/release/12.3.0-notes.rst
new file mode 100644
index 0000000000..0a14aea9e6
--- /dev/null
+++ b/cuda_bindings/docs/source/release/12.3.0-notes.rst
@@ -0,0 +1,43 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 12.3.0 Release notes
+================================
+
+Released on October 19, 2023
+
+Highlights
+----------
+- Rebase to CUDA Toolkit 12.3
+- Resolve `Issue #16 <https://github.com/NVIDIA/cuda-python/issues/16>`_: cuda.cudart.cudaRuntimeGetVersion() hard-codes the runtime version, rather than querying the runtime
+    - New API cudart.getLocalRuntimeVersion()
+- Resolve `Issue #48 <https://github.com/NVIDIA/cuda-python/issues/48>`_: Dropping Python 3.8
+- Resolve `Issue #51 <https://github.com/NVIDIA/cuda-python/issues/51>`_: Dropping package releases for ppc64 on PYPI and conda-nvidia channel
+
+Limitations
+-----------
+
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+- Symbol APIs
+    - cudaGraphExecMemcpyNodeSetParamsFromSymbol
+    - cudaGraphExecMemcpyNodeSetParamsToSymbol
+    - cudaGraphAddMemcpyNodeToSymbol
+    - cudaGraphAddMemcpyNodeFromSymbol
+    - cudaGraphMemcpyNodeSetParamsToSymbol
+    - cudaGraphMemcpyNodeSetParamsFromSymbol
+    - cudaMemcpyToSymbol
+    - cudaMemcpyFromSymbol
+    - cudaMemcpyToSymbolAsync
+    - cudaMemcpyFromSymbolAsync
+    - cudaGetSymbolAddress
+    - cudaGetSymbolSize
+    - cudaGetFuncBySymbol
+- Launch Options
+    - cudaLaunchKernel
+    - cudaLaunchCooperativeKernel
+    - cudaLaunchCooperativeKernelMultiDevice
+- cudaSetValidDevices
+- cudaVDPAUSetVDPAUDevice
+- cudaFuncGetName
diff --git a/cuda_bindings/docs/source/release/12.4.0-notes.md b/cuda_bindings/docs/source/release/12.4.0-notes.rst
similarity index 67%
rename from cuda_bindings/docs/source/release/12.4.0-notes.md
rename to cuda_bindings/docs/source/release/12.4.0-notes.rst
index 191ecc644e..b71a4ce7d4 100644
--- a/cuda_bindings/docs/source/release/12.4.0-notes.md
+++ b/cuda_bindings/docs/source/release/12.4.0-notes.rst
@@ -1,14 +1,21 @@
-# CUDA Python 12.4.0 Release notes
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 12.4.0 Release notes
+================================
 
 Released on March 5, 2024
 
-## Highlights
+Highlights
+----------
 - Rebase to CUDA Toolkit 12.4
 - Add PyPI/Conda support for Python 12
 
-## Limitations
+Limitations
+-----------
 
-### CUDA Functions Not Supported in this Release
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - Symbol APIs
     - cudaGraphExecMemcpyNodeSetParamsFromSymbol
diff --git a/cuda_bindings/docs/source/release/12.5.0-notes.md b/cuda_bindings/docs/source/release/12.5.0-notes.rst
similarity index 60%
rename from cuda_bindings/docs/source/release/12.5.0-notes.md
rename to cuda_bindings/docs/source/release/12.5.0-notes.rst
index b0e527a8a7..0ac6a25ee0 100644
--- a/cuda_bindings/docs/source/release/12.5.0-notes.md
+++ b/cuda_bindings/docs/source/release/12.5.0-notes.rst
@@ -1,14 +1,21 @@
-# CUDA Python 12.5.0 Release notes
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 12.5.0 Release notes
+================================
 
 Released on May 21, 2024
 
-## Highlights
+Highlights
+----------
 - Rebase to CUDA Toolkit 12.5
-- Resolve [Issue #58](https://github.com/NVIDIA/cuda-python/issues/58): Interop between CUdeviceptr and Runtime
+- Resolve `Issue #58 <https://github.com/NVIDIA/cuda-python/issues/58>`_: Interop between CUdeviceptr and Runtime
 
-## Limitations
+Limitations
+-----------
 
-### CUDA Functions Not Supported in this Release
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - Symbol APIs
     - cudaGraphExecMemcpyNodeSetParamsFromSymbol
diff --git a/cuda_bindings/docs/source/release/12.6.0-notes.md b/cuda_bindings/docs/source/release/12.6.0-notes.rst
similarity index 50%
rename from cuda_bindings/docs/source/release/12.6.0-notes.md
rename to cuda_bindings/docs/source/release/12.6.0-notes.rst
index 466e2eec11..9cd5bbff59 100644
--- a/cuda_bindings/docs/source/release/12.6.0-notes.md
+++ b/cuda_bindings/docs/source/release/12.6.0-notes.rst
@@ -1,16 +1,23 @@
-# CUDA Python 12.6.0 Release notes
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 12.6.0 Release notes
+================================
 
 Released on August 1, 2024
 
-## Highlights
+Highlights
+----------
 - Rebase to CUDA Toolkit 12.6
-- Resolve [Issue #32](https://github.com/NVIDIA/cuda-python/issues/32): Add 'pywin32' as Windows requirement
-- Resolve [Issue #72](https://github.com/NVIDIA/cuda-python/issues/72): Allow both lists and tuples as parameter
-- Resolve [Issue #73](https://github.com/NVIDIA/cuda-python/issues/73): Fix 'cuLibraryLoadData' processing of parameters
+- Resolve `Issue #32 <https://github.com/NVIDIA/cuda-python/issues/32>`_: Add 'pywin32' as Windows requirement
+- Resolve `Issue #72 <https://github.com/NVIDIA/cuda-python/issues/72>`_: Allow both lists and tuples as parameter
+- Resolve `Issue #73 <https://github.com/NVIDIA/cuda-python/issues/73>`_: Fix 'cuLibraryLoadData' processing of parameters
 
-## Limitations
+Limitations
+-----------
 
-### CUDA Functions Not Supported in this Release
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - Symbol APIs
     - cudaGraphExecMemcpyNodeSetParamsFromSymbol
diff --git a/cuda_bindings/docs/source/release/12.6.1-notes.md b/cuda_bindings/docs/source/release/12.6.1-notes.md
deleted file mode 100644
index 360047125e..0000000000
--- a/cuda_bindings/docs/source/release/12.6.1-notes.md
+++ /dev/null
@@ -1,56 +0,0 @@
-# CUDA Python 12.6.1 Release notes
-
-Released on October 7, 2024
-
-## Highlights
-- Resolve [Issue #90](https://github.com/NVIDIA/cuda-python/issues/90): Use new layout in preparation for cuda-python becoming a metapackage
-- Resolve [Issue #75](https://github.com/NVIDIA/cuda-python/issues/75): CUDA namespace cleanup
-
-## CUDA namespace cleanup with a new module layout
-
-[Issue #75](https://github.com/NVIDIA/cuda-python/issues/75) explains in detail what the new module layout is, what problem it fixes and how it impacts the users. However for the sake of completeness, this release notes will highlight key points of this change.
-
-Before this change, `cuda-python` was tightly coupled to CUDA Toolkit releases and all new features would inherit this coupling regardless of their applicability. As we develop new features, this coupling was becoming overly restrictive and motivated a new solution: Convert `cuda-python` into a metapackage where we use `cuda` as a namespace with existing bindings code moved to a `cuda_bindings` subpackage.
-
-This patch release applies the new module layout for the bindings as follows:
-- `cuda.cuda` -> `cuda.bindings.driver`
-- `cuda.ccuda` -> `cuda.bindings.cydriver`
-- `cuda.cudart` -> `cuda.bindings.runtime`
-- `cuda.ccudart` -> `cuda.bindings.cyruntime`
-- `cuda.nvrtc` -> `cuda.bindings.nvrtc`
-- `cuda.cnvrtc` -> `cuda.bindings.cynvrtc`
-
-Deprecation warnings are turned on as a notice to switch to the new module layout.
-
-```{note} This is non-breaking, backwards compatible change. All old module path will continue work as they "forward" user calls towards the new layout.
-```
-
-## Limitations
-
-### Know issues
-- [Issue #215](https://github.com/NVIDIA/cuda-python/issues/215)
-
-### CUDA Functions Not Supported in this Release
-
-- Symbol APIs
-    - cudaGraphExecMemcpyNodeSetParamsFromSymbol
-    - cudaGraphExecMemcpyNodeSetParamsToSymbol
-    - cudaGraphAddMemcpyNodeToSymbol
-    - cudaGraphAddMemcpyNodeFromSymbol
-    - cudaGraphMemcpyNodeSetParamsToSymbol
-    - cudaGraphMemcpyNodeSetParamsFromSymbol
-    - cudaMemcpyToSymbol
-    - cudaMemcpyFromSymbol
-    - cudaMemcpyToSymbolAsync
-    - cudaMemcpyFromSymbolAsync
-    - cudaGetSymbolAddress
-    - cudaGetSymbolSize
-    - cudaGetFuncBySymbol
-- Launch Options
-    - cudaLaunchKernel
-    - cudaLaunchCooperativeKernel
-    - cudaLaunchCooperativeKernelMultiDevice
-- cudaSetValidDevices
-- cudaVDPAUSetVDPAUDevice
-- cudaFuncGetName
-- cudaFuncGetParamInfo
diff --git a/cuda_bindings/docs/source/release/12.6.1-notes.rst b/cuda_bindings/docs/source/release/12.6.1-notes.rst
new file mode 100644
index 0000000000..2571633445
--- /dev/null
+++ b/cuda_bindings/docs/source/release/12.6.1-notes.rst
@@ -0,0 +1,64 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 12.6.1 Release notes
+================================
+
+Released on October 7, 2024
+
+Highlights
+----------
+- Resolve `Issue #90 <https://github.com/NVIDIA/cuda-python/issues/90>`_: Use new layout in preparation for cuda-python becoming a metapackage
+- Resolve `Issue #75 <https://github.com/NVIDIA/cuda-python/issues/75>`_: CUDA namespace cleanup
+
+CUDA namespace cleanup with a new module layout
+-----------------------------------------------
+
+`Issue #75 <https://github.com/NVIDIA/cuda-python/issues/75>`_ explains in detail what the new module layout is, what problem it fixes and how it impacts the users. However for the sake of completeness, this release notes will highlight key points of this change.
+
+Before this change, ``cuda-python`` was tightly coupled to CUDA Toolkit releases and all new features would inherit this coupling regardless of their applicability. As we develop new features, this coupling was becoming overly restrictive and motivated a new solution: Convert ``cuda-python`` into a metapackage where we use ``cuda`` as a namespace with existing bindings code moved to a ``cuda_bindings`` subpackage.
+
+This patch release applies the new module layout for the bindings as follows:
+- ``cuda.cuda`` -> ``cuda.bindings.driver``
+- ``cuda.ccuda`` -> ``cuda.bindings.cydriver``
+- ``cuda.cudart`` -> ``cuda.bindings.runtime``
+- ``cuda.ccudart`` -> ``cuda.bindings.cyruntime``
+- ``cuda.nvrtc`` -> ``cuda.bindings.nvrtc``
+- ``cuda.cnvrtc`` -> ``cuda.bindings.cynvrtc``
+
+Deprecation warnings are turned on as a notice to switch to the new module layout.
+
+.. note:: This is non-breaking, backwards compatible change. All old module path will continue work as they "forward" user calls towards the new layout.
+
+Limitations
+-----------
+
+Know issues
+^^^^^^^^^^^
+- `Issue #215 <https://github.com/NVIDIA/cuda-python/issues/215>`_
+
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+- Symbol APIs
+    - cudaGraphExecMemcpyNodeSetParamsFromSymbol
+    - cudaGraphExecMemcpyNodeSetParamsToSymbol
+    - cudaGraphAddMemcpyNodeToSymbol
+    - cudaGraphAddMemcpyNodeFromSymbol
+    - cudaGraphMemcpyNodeSetParamsToSymbol
+    - cudaGraphMemcpyNodeSetParamsFromSymbol
+    - cudaMemcpyToSymbol
+    - cudaMemcpyFromSymbol
+    - cudaMemcpyToSymbolAsync
+    - cudaMemcpyFromSymbolAsync
+    - cudaGetSymbolAddress
+    - cudaGetSymbolSize
+    - cudaGetFuncBySymbol
+- Launch Options
+    - cudaLaunchKernel
+    - cudaLaunchCooperativeKernel
+    - cudaLaunchCooperativeKernelMultiDevice
+- cudaSetValidDevices
+- cudaVDPAUSetVDPAUDevice
+- cudaFuncGetName
+- cudaFuncGetParamInfo
diff --git a/cuda_bindings/docs/source/release/12.6.2-notes.md b/cuda_bindings/docs/source/release/12.6.2-notes.rst
similarity index 54%
rename from cuda_bindings/docs/source/release/12.6.2-notes.md
rename to cuda_bindings/docs/source/release/12.6.2-notes.rst
index 938b9f5a61..4ce87dd8b4 100644
--- a/cuda_bindings/docs/source/release/12.6.2-notes.md
+++ b/cuda_bindings/docs/source/release/12.6.2-notes.rst
@@ -1,15 +1,21 @@
-# CUDA Python 12.6.2 Release notes
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
 
-Released on November 5, 2024. Post 1 rebuild released on November 12, 2024.
+CUDA Python 12.6.2 Release notes
+================================
 
-## Highlights
-- Resolve [Issue #215](https://github.com/NVIDIA/cuda-python/issues/215): module `cuda.ccudart` has no attribute `__pyx_capi__`
-- Resolve [Issue #226](https://github.com/NVIDIA/cuda-python/issues/226): top-level Cython source files not packaged
+Released on November 5, 2024. Post 1 rebuild released on November 12, 2024.
 
+Highlights
+----------
+- Resolve `Issue #215 <https://github.com/NVIDIA/cuda-python/issues/215>`_: module ``cuda.ccudart`` has no attribute ``__pyx_capi__``
+- Resolve `Issue #226 <https://github.com/NVIDIA/cuda-python/issues/226>`_: top-level Cython source files not packaged
 
-## Limitations
+Limitations
+-----------
 
-### CUDA Functions Not Supported in this Release
+CUDA Functions Not Supported in this Release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - Symbol APIs
     - cudaGraphExecMemcpyNodeSetParamsFromSymbol
diff --git a/cuda_bindings/docs/source/release/12.8.0-notes.md b/cuda_bindings/docs/source/release/12.8.0-notes.md
deleted file mode 100644
index c93f2d9df9..0000000000
--- a/cuda_bindings/docs/source/release/12.8.0-notes.md
+++ /dev/null
@@ -1,36 +0,0 @@
-# `cuda-bindings` 12.8.0 Release notes
-
-Released on January 24, 2025.
-
-
-## Highlights
-
-- Support Python 3.13
-- Add bindings for nvJitLink (requires nvJitLink from CUDA 12.3 or above)
-- Add optional dependencies on CUDA NVRTC and nvJitLink wheels
-- Enable discovery and loading of shared libraries from CUDA wheels
-- `cuda-python` is now a meta package, currently depending only on `cuda-bindings` ([see RFC](https://github.com/NVIDIA/cuda-python/issues/105))
-
-
-## Wheels support for optional dependencies
-
-Optional dependencies are added for packages:
-
-- nvidia-cuda-nvrtc-cu12
-- nvidia-nvjitlink-cu12
-
-Installing these dependencies with `cuda-python` can be done using:
-```{code-block} shell
-pip install cuda-python[all]
-```
-Same applies to `cuda-bindings`.
-
-
-## Discovery and loading of shared library dependencies from wheels
-
-Shared library search paths for wheel builds are now extended to check site-packages. This allows `cuda-python`/`cuda-bindings` to seamlessly use the aforementioned CUDA Toolkit wheels installed in the user's Python environment.
-
-
-## Known issues
-
-- Updating from older versions (v12.6.2.post1 and below) via `pip install -U cuda-python` might not work. Please do a clean re-installation by uninstalling `pip uninstall -y cuda-python` followed by installing `pip install cuda-python`.
diff --git a/cuda_bindings/docs/source/release/12.8.0-notes.rst b/cuda_bindings/docs/source/release/12.8.0-notes.rst
new file mode 100644
index 0000000000..6c9c951779
--- /dev/null
+++ b/cuda_bindings/docs/source/release/12.8.0-notes.rst
@@ -0,0 +1,42 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+``cuda-bindings`` 12.8.0 Release notes
+====================================
+
+Released on January 24, 2025.
+
+Highlights
+----------
+
+- Support Python 3.13
+- Add bindings for nvJitLink (requires nvJitLink from CUDA 12.3 or above)
+- Add optional dependencies on CUDA NVRTC and nvJitLink wheels
+- Enable discovery and loading of shared libraries from CUDA wheels
+- ``cuda-python`` is now a meta package, currently depending only on ``cuda-bindings`` (`see RFC <https://github.com/NVIDIA/cuda-python/issues/105>`_)
+
+Wheels support for optional dependencies
+----------------------------------------
+
+Optional dependencies are added for packages:
+
+- nvidia-cuda-nvrtc-cu12
+- nvidia-nvjitlink-cu12
+
+Installing these dependencies with ``cuda-python`` can be done using:
+
+.. code-block:: shell
+
+   pip install cuda-python[all]
+
+Same applies to ``cuda-bindings``.
+
+Discovery and loading of shared library dependencies from wheels
+----------------------------------------------------------------
+
+Shared library search paths for wheel builds are now extended to check site-packages. This allows ``cuda-python``/``cuda-bindings`` to seamlessly use the aforementioned CUDA Toolkit wheels installed in the user's Python environment.
+
+Known issues
+------------
+
+- Updating from older versions (v12.6.2.post1 and below) via ``pip install -U cuda-python`` might not work. Please do a clean re-installation by uninstalling ``pip uninstall -y cuda-python`` followed by installing ``pip install cuda-python``.
diff --git a/cuda_bindings/docs/source/tips_and_tricks.rst b/cuda_bindings/docs/source/tips_and_tricks.rst
index 97f585f9b4..cc666ca275 100644
--- a/cuda_bindings/docs/source/tips_and_tricks.rst
+++ b/cuda_bindings/docs/source/tips_and_tricks.rst
@@ -7,16 +7,16 @@ Tips and Tricks
 Getting the address of underlying C objects from the low-level bindings
 =======================================================================
 
-All CUDA C types are exposed to Python as Python classes. For example, the :class:`~cuda.bindings.driver.CUstream` type is exposed as a class with methods :meth:`~cuda.bindings.driver.CUstream.getPtr()` and :meth:`~cuda.bindings.driver.CUstream.__int__()` implemented.
-
-There is an important distinction between the ``getPtr()`` method and the behaviour of ``__int__()``. Since a ``CUstream`` is itself just a pointer, calling ``instance_of_CUstream.getPtr()`` returns the pointer *to* the pointer, instead of the value of the ``CUstream`` C object that is the pointer to the underlying stream handle. ``int(instance_of_CUstream)`` returns the value of the ``CUstream`` converted to a Python int and is the actual address of the underlying handle.
-
 .. warning::
 
    Using ``int(cuda_obj)`` to retrieve the underlying address of a CUDA object is deprecated and
    subject to future removal. Please switch to use :func:`~cuda.bindings.utils.get_cuda_native_handle`
    instead.
 
+All CUDA C types are exposed to Python as Python classes. For example, the :class:`~cuda.bindings.driver.CUstream` type is exposed as a class with methods :meth:`~cuda.bindings.driver.CUstream.getPtr()` and :meth:`~cuda.bindings.driver.CUstream.__int__()` implemented.
+
+There is an important distinction between the ``getPtr()`` method and the behaviour of ``__int__()``. Since a ``CUstream`` is itself just a pointer, calling ``instance_of_CUstream.getPtr()`` returns the pointer *to* the pointer, instead of the value of the ``CUstream`` C object that is the pointer to the underlying stream handle. ``int(instance_of_CUstream)`` returns the value of the ``CUstream`` converted to a Python int and is the actual address of the underlying handle.
+
 
 Lifetime management of the CUDA objects
 =======================================
diff --git a/cuda_core/docs/source/_static/logo-dark-mode.png b/cuda_core/docs/source/_static/logo-dark-mode.png
deleted file mode 100644
index 6b005a283b..0000000000
Binary files a/cuda_core/docs/source/_static/logo-dark-mode.png and /dev/null differ
diff --git a/cuda_core/docs/source/_static/logo-light-mode.png b/cuda_core/docs/source/_static/logo-light-mode.png
deleted file mode 100644
index c07d6848c9..0000000000
Binary files a/cuda_core/docs/source/_static/logo-light-mode.png and /dev/null differ
diff --git a/cuda_bindings/docs/source/conduct.md b/cuda_core/docs/source/conduct.rst
similarity index 83%
rename from cuda_bindings/docs/source/conduct.md
rename to cuda_core/docs/source/conduct.rst
index 80f5032e86..1c00f5c343 100644
--- a/cuda_bindings/docs/source/conduct.md
+++ b/cuda_core/docs/source/conduct.rst
@@ -1,10 +1,16 @@
-# Code of Conduct
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: Apache-2.0
 
-## Overview
+Code of Conduct
+===============
 
-Define the code of conduct followed and enforced for the `cuda.bindings` project.
+Overview
+--------
 
-## Our Pledge
+Define the code of conduct followed and enforced for the ``cuda.core`` project.
+
+Our Pledge
+----------
 
 In the interest of fostering an open and welcoming environment, we as
 contributors and maintainers pledge to making participation in our project and
@@ -13,7 +19,8 @@ size, disability, ethnicity, sex characteristics, gender identity and expression
 level of experience, education, socio-economic status, nationality, personal
 appearance, race, religion, or sexual identity and orientation.
 
-## Our Standards
+Our Standards
+-------------
 
 Examples of behavior that contributes to creating a positive environment
 include:
@@ -35,7 +42,8 @@ Examples of unacceptable behavior by participants include:
 * Other conduct which could reasonably be considered inappropriate in a
   professional setting
 
-## Our Responsibilities
+Our Responsibilities
+--------------------
 
 Project maintainers are responsible for clarifying the standards of acceptable
 behavior and are expected to take appropriate and fair corrective action in
@@ -47,7 +55,8 @@ that are not aligned to this Code of Conduct, or to ban temporarily or
 permanently any contributor for other behaviors that they deem inappropriate,
 threatening, offensive, or harmful.
 
-## Scope
+Scope
+-----
 
 This Code of Conduct applies both within project spaces and in public spaces
 when an individual is representing the project or its community. Examples of
@@ -56,11 +65,12 @@ address, posting via an official social media account, or acting as an appointed
 representative at an online or offline event. Representation of a project may be
 further defined and clarified by project maintainers.
 
-## Enforcement
+Enforcement
+-----------
 
 Instances of abusive, harassing, or otherwise unacceptable behavior may be
 reported by contacting the project team at
-[cuda-python-conduct@nvidia.com](mailto:cuda-python-conduct@nvidia.com) All
+`cuda-python-conduct@nvidia.com <mailto:cuda-python-conduct@nvidia.com>`_ All
 complaints will be reviewed and investigated and will result in a response that
 is deemed necessary and appropriate to the circumstances. The project team is
 obligated to maintain confidentiality with regard to the reporter of an
@@ -71,12 +81,11 @@ Project maintainers who do not follow or enforce the Code of Conduct in good
 faith may face temporary or permanent repercussions as determined by other
 members of the project's leadership.
 
-## Attribution
+Attribution
+-----------
 
-This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+This Code of Conduct is adapted from the `Contributor Covenant <https://www.contributor-covenant.org>`_, version 1.4,
 available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
 
-[homepage]: https://www.contributor-covenant.org
-
 For answers to common questions about this code of conduct, see
 https://www.contributor-covenant.org/faq
diff --git a/cuda_core/docs/source/getting-started.md b/cuda_core/docs/source/getting-started.md
deleted file mode 100644
index 6fffa364e0..0000000000
--- a/cuda_core/docs/source/getting-started.md
+++ /dev/null
@@ -1,114 +0,0 @@
-# Overview
-
-## What is `cuda core`?
-
-`cuda.core` provides a Pythonic interface to the CUDA runtime and other functionality,
-including:
-
-- Compiling and launching CUDA kernels
-- Asynchronous concurrent execution with CUDA graphs, streams and events
-- Coordinating work across multiple CUDA devices
-- Allocating, transferring, and managing device memory
-- Runtime linking of device code with Link-Time Optimization (LTO)
-- and much more!
-
-Rather than providing 1:1 equivalents of the CUDA driver and runtime APIs
-(for that, see [`cuda.bindings`][bindings]), `cuda.core` provides high-level constructs such as:
-
-- {class}`Device <cuda.core.experimental.Device>` class for GPU device operations and context management.
-- {class}`Buffer <cuda.core.experimental.Buffer>` and {class}`MemoryResource <cuda.core.experimental.MemoryResource>` classes for memory allocation and management.
-- {class}`Program <cuda.core.experimental.Program>` for JIT compilation of CUDA kernels.
-- {class}`GraphBuilder <cuda.core.experimental.GraphBuilder>` for building and executing CUDA graphs.
-- {class}`Stream <cuda.core.experimental.Stream>` and {class}`Event <cuda.core.experimental.Event>` for asynchronous execution and timing.
-
-## Example: Compiling and Launching a CUDA kernel
-
-To get a taste for `cuda.core`, let's walk through a simple example that compiles and launches a vector addition kernel.
-You can find the complete example in [`vector_add.py`][vector_add_example].
-
-First, we define a string containing the CUDA C++ kernel. Note that this is a templated kernel:
-
-```python
-# compute c = a + b
-code = """
-template<typename T>
-__global__ void vector_add(const T* A,
-                           const T* B,
-                           T* C,
-                           size_t N) {
-    const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
-    for (size_t i=tid; i<N; i+=gridDim.x*blockDim.x) {
-        C[tid] = A[tid] + B[tid];
-    }
-}
-"""
-```
-
-Next, we create a {class}`Device <cuda.core.experimental.Device>` object
-and a corresponding {class}`Stream <cuda.core.experimental.Stream>`.
-Don't forget to use {meth}`Device.set_current() <cuda.core.experimental.Device.set_current>`!
-
-```python
-import cupy as cp
-from cuda.core.experimental import Device, LaunchConfig, Program, ProgramOptions, launch
-
-dev = Device()
-dev.set_current()
-s = dev.create_stream()
-```
-
-Next, we compile the CUDA C++ kernel from earlier using the {class}`Program <cuda.core.experimental.Program>` class.
-The result of the compilation  is saved as a CUBIN.
-Note the use of the `name_expressions` parameter to the {meth}`Program.compile() <cuda.core.experimental.Program.compile>` method to specify which kernel template instantiations to compile:
-
-```python
-arch = "".join(f"{i}" for i in dev.compute_capability)
-program_options = ProgramOptions(std="c++17", arch=f"sm_{arch}")
-prog = Program(code, code_type="c++", options=program_options)
-mod = prog.compile("cubin", name_expressions=("vector_add<float>",))
-```
-
-Next, we retrieve the compiled kernel from the CUBIN and prepare the arguments and kernel configuration.
-We're using [CuPy][cupy] arrays as inputs for this example, but you can use PyTorch tensors too
-(we show how to do this in one of our [examples][examples]).
-
-```python
-ker = mod.get_kernel("vector_add<float>")
-
-# Prepare input/output arrays (using CuPy)
-size = 50000
-rng = cp.random.default_rng()
-a = rng.random(size, dtype=cp.float32)
-b = rng.random(size, dtype=cp.float32)
-c = cp.empty_like(a)
-
-# Configure launch parameters
-block = 256
-grid = (size + block - 1) // block
-config = LaunchConfig(grid=grid, block=block)
-```
-
-Finally, we use the {func}`launch <cuda.core.experimental.launch>` function to execute our kernel on the specified stream with the given configuration and arguments. Note the use of `.data.ptr` to get the pointer to the array data.
-
-```python
-launch(s, config, ker, a.data.ptr, b.data.ptr, c.data.ptr, cp.uint64(size))
-s.sync()
-```
-
-This example demonstrates one of the core workflows enabled by `cuda.core`: compiling and launching CUDA code.
-Note the clean, Pythonic interface, and absence of any direct calls to the CUDA runtime/driver APIs.
-
-## Examples and Recipes
-
-As we mentioned before, `cuda.core` can do much more than just compile and launch kernels.
-
-The best way to explore and learn the different features `cuda.core` is through
-our [`examples`][examples]. Find one that matches your use-case, and modify it to fit your needs!
-
-
-[bindings]: https://nvidia.github.io/cuda-python/cuda-bindings/latest/
-[cai]: https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html
-[cupy]: https://cupy.dev/
-[dlpack]: https://dmlc.github.io/dlpack/latest/
-[examples]: https://github.com/NVIDIA/cuda-python/tree/main/cuda_core/examples
-[vector_add_example]: https://github.com/NVIDIA/cuda-python/tree/main/cuda_core/examples/vector_add.py
diff --git a/cuda_core/docs/source/getting-started.rst b/cuda_core/docs/source/getting-started.rst
new file mode 100644
index 0000000000..502ea66375
--- /dev/null
+++ b/cuda_core/docs/source/getting-started.rst
@@ -0,0 +1,114 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: Apache-2.0
+
+.. currentmodule:: cuda.core.experimental
+
+Overview
+========
+
+What is ``cuda.core``?
+----------------------
+
+``cuda.core`` provides a Pythonic interface to the CUDA runtime and other functionality,
+including:
+
+- Compiling and launching CUDA kernels
+- Asynchronous concurrent execution with CUDA graphs, streams and events
+- Coordinating work across multiple CUDA devices
+- Allocating, transferring, and managing device memory
+- Runtime linking of device code with Link-Time Optimization (LTO)
+- and much more!
+
+Rather than providing 1:1 equivalents of the CUDA driver and runtime APIs
+(for that, see `cuda.bindings <https://nvidia.github.io/cuda-python/cuda-bindings/latest/>`_), ``cuda.core`` provides high-level constructs such as:
+
+- :class:`Device` class for GPU device operations and context management.
+- :class:`Buffer` and :class:`MemoryResource` classes for memory allocation and management.
+- :class:`Program` for JIT compilation of CUDA kernels.
+- :class:`GraphBuilder` for building and executing CUDA graphs.
+- :class:`Stream` and :class:`Event` for asynchronous execution and timing.
+
+Example: Compiling and Launching a CUDA kernel
+----------------------------------------------
+
+To get a taste for ``cuda.core``, let's walk through a simple example that compiles and launches a vector addition kernel.
+You can find the complete example in `vector_add.py <https://github.com/NVIDIA/cuda-python/tree/main/cuda_core/examples/vector_add.py>`_.
+
+First, we define a string containing the CUDA C++ kernel. Note that this is a templated kernel:
+
+.. code-block:: python
+
+   # compute c = a + b
+   code = """
+   template<typename T>
+   __global__ void vector_add(const T* A,
+                              const T* B,
+                              T* C,
+                              size_t N) {
+       const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
+       for (size_t i=tid; i<N; i+=gridDim.x*blockDim.x) {
+           C[tid] = A[tid] + B[tid];
+       }
+   }
+   """
+
+Next, we create a :class:`Device` object
+and a corresponding :class:`Stream`.
+Don't forget to use :meth:`Device.set_current`!
+
+.. code-block:: python
+
+   import cupy as cp
+   from cuda.core.experimental import Device, LaunchConfig, Program, ProgramOptions, launch
+   
+   dev = Device()
+   dev.set_current()
+   s = dev.create_stream()
+
+Next, we compile the CUDA C++ kernel from earlier using the :class:`Program` class.
+The result of the compilation  is saved as a CUBIN.
+Note the use of the ``name_expressions`` parameter to the :meth:`Program.compile` method to specify which kernel template instantiations to compile:
+
+.. code-block:: python
+
+   program_options = ProgramOptions(std="c++17", arch=f"sm_{dev.arch}")
+   prog = Program(code, code_type="c++", options=program_options)
+   mod = prog.compile("cubin", name_expressions=("vector_add<float>",))
+
+Next, we retrieve the compiled kernel from the CUBIN and prepare the arguments and kernel configuration.
+We're using `CuPy <https://cupy.dev/>`_ arrays as inputs for this example, but you can use PyTorch tensors too
+(we show how to do this in one of our `examples <https://github.com/NVIDIA/cuda-python/tree/main/cuda_core/examples>`_).
+
+.. code-block:: python
+
+   ker = mod.get_kernel("vector_add<float>")
+   
+   # Prepare input/output arrays (using CuPy)
+   size = 50000
+   rng = cp.random.default_rng()
+   a = rng.random(size, dtype=cp.float32)
+   b = rng.random(size, dtype=cp.float32)
+   c = cp.empty_like(a)
+   
+   # Configure launch parameters
+   block = 256
+   grid = (size + block - 1) // block
+   config = LaunchConfig(grid=grid, block=block)
+
+Finally, we use the :func:`launch` function to execute our kernel on the specified stream with the given configuration and arguments. Note the use of ``.data.ptr`` to get the pointer to the array data.
+
+.. code-block:: python
+
+   launch(s, config, ker, a.data.ptr, b.data.ptr, c.data.ptr, cp.uint64(size))
+   s.sync()
+
+This example demonstrates one of the core workflows enabled by ``cuda.core``: compiling and launching CUDA code.
+Note the clean, Pythonic interface, and absence of any direct calls to the CUDA runtime/driver APIs.
+
+Examples and Recipes
+--------------------
+
+As we mentioned before, ``cuda.core`` can do much more than just compile and launch kernels.
+
+The best way to explore and learn the different features ``cuda.core`` is through
+our `examples <https://github.com/NVIDIA/cuda-python/tree/main/cuda_core/examples>`_. Find one that matches your use-case, and modify it to fit your needs!
diff --git a/cuda_core/docs/source/install.md b/cuda_core/docs/source/install.md
deleted file mode 100644
index 4f66eeff13..0000000000
--- a/cuda_core/docs/source/install.md
+++ /dev/null
@@ -1,48 +0,0 @@
-# Installation
-
-## Runtime Requirements
-
-`cuda.core` is supported on all platforms that CUDA is supported. Specific
-dependencies are as follows:
-
-|                   | CUDA 11      | CUDA 12     |
-|------------------ | ------------ | ----------- |
-| CUDA Toolkit [^1] | 11.2 - 11.8  | 12.x        |
-| Driver            | 450.80.02+ (Linux), 452.39+ (Windows) | 525.60.13+ (Linux), 527.41+ (Windows) |
-
-[^1]: Including `cuda-python`.
-
-`cuda.core` supports Python 3.9 - 3.13, on Linux (x86-64, arm64) and Windows (x86-64).
-
-
-## Installing from PyPI
-
-`cuda.core` works with `cuda.bindings` (part of `cuda-python`) 11 or 12. Test dependencies now use the ``cuda-toolkit`` metapackage for improved dependency resolution. For example with CUDA 12:
-```console
-$ pip install cuda-core[cu12]
-```
-and likewise use `[cu11]` for CUDA 11, or `[cu13]` for CUDA 13.
-
-Note that using `cuda.core` with NVRTC installed from PyPI via `pip install` requires
-`cuda.bindings` 12.8.0+ or 11.8.6+. Likewise, with nvJitLink it requires 12.8.0+.
-
-
-## Installing from Conda (conda-forge)
-
-Same as above, `cuda.core` can be installed in a CUDA 11 or 12 environment. For example with CUDA 12:
-```console
-$ conda install -c conda-forge cuda-core cuda-version=12
-```
-and likewise use `cuda-version=11` for CUDA 11.
-
-Note that to use `cuda.core` with nvJitLink installed from conda-forge requires `cuda.bindings` 12.8.0+.
-
-
-## Installing from Source
-
-```console
-$ git clone https://github.com/NVIDIA/cuda-python
-$ cd cuda-python/cuda_core
-$ pip install .
-```
-`cuda-bindings` 11.x or 12.x is a required dependency.
diff --git a/cuda_core/docs/source/install.rst b/cuda_core/docs/source/install.rst
new file mode 100644
index 0000000000..8bc1faa0e1
--- /dev/null
+++ b/cuda_core/docs/source/install.rst
@@ -0,0 +1,67 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: Apache-2.0
+
+Installation
+============
+
+Runtime Requirements
+--------------------
+
+``cuda.core`` is supported on all platforms that CUDA is supported. Specific
+dependencies are as follows:
+
+.. list-table:: 
+   :header-rows: 1
+
+   * - 
+     - CUDA 11
+     - CUDA 12
+   * - CUDA Toolkit\ [#f1]_
+     - 11.2 - 11.8
+     - 12.x
+   * - Driver
+     - 450.80.02+ (Linux), 452.39+ (Windows)
+     - 525.60.13+ (Linux), 527.41+ (Windows)
+
+.. [#f1] Including ``cuda-python``.
+
+
+``cuda.core`` supports Python 3.9 - 3.13, on Linux (x86-64, arm64) and Windows (x86-64).
+
+Installing from PyPI
+--------------------
+
+``cuda.core`` works with ``cuda.bindings`` (part of ``cuda-python``) 11 or 12. Test dependencies now use the ``cuda-toolkit`` metapackage for improved dependency resolution. For example with CUDA 12:
+
+.. code-block:: console
+
+   $ pip install cuda-core[cu12]
+
+and likewise use ``[cu11]`` for CUDA 11, or ``[cu13]`` for CUDA 13.
+
+Note that using ``cuda.core`` with NVRTC installed from PyPI via ``pip install`` requires
+``cuda.bindings`` 12.8.0+ or 11.8.6+. Likewise, with nvJitLink it requires 12.8.0+.
+
+Installing from Conda (conda-forge)
+-----------------------------------
+
+Same as above, ``cuda.core`` can be installed in a CUDA 11 or 12 environment. For example with CUDA 12:
+
+.. code-block:: console
+
+   $ conda install -c conda-forge cuda-core cuda-version=12
+
+and likewise use ``cuda-version=11`` for CUDA 11.
+
+Note that to use ``cuda.core`` with nvJitLink installed from conda-forge requires ``cuda.bindings`` 12.8.0+.
+
+Installing from Source
+----------------------
+
+.. code-block:: console
+
+   $ git clone https://github.com/NVIDIA/cuda-python
+   $ cd cuda-python/cuda_core
+   $ pip install .
+
+``cuda-bindings`` 11.x or 12.x is a required dependency.
diff --git a/cuda_core/docs/source/release.rst b/cuda_core/docs/source/release.rst
index 954d296e29..dc28b31220 100644
--- a/cuda_core/docs/source/release.rst
+++ b/cuda_core/docs/source/release.rst
@@ -7,10 +7,10 @@ Release Notes
 .. toctree::
    :maxdepth: 3
 
-   release/0.X.Y-notes
-   release/0.3.2-notes
-   release/0.3.1-notes
-   release/0.3.0-notes
-   release/0.2.0-notes
-   release/0.1.1-notes
-   release/0.1.0-notes
+   0.X.Y <release/0.X.Y-notes>
+   0.3.2 <release/0.3.2-notes>
+   0.3.1 <release/0.3.1-notes>
+   0.3.0 <release/0.3.0-notes>
+   0.2.0 <release/0.2.0-notes>
+   0.1.1 <release/0.1.1-notes>
+   0.1.0 <release/0.1.0-notes>
diff --git a/cuda_core/docs/source/release/0.3.1-notes.rst b/cuda_core/docs/source/release/0.3.1-notes.rst
index 33ea3b48e4..82138763db 100644
--- a/cuda_core/docs/source/release/0.3.1-notes.rst
+++ b/cuda_core/docs/source/release/0.3.1-notes.rst
@@ -12,7 +12,7 @@ Released on July 2, 2025
 Highlights
 ----------
 
-- Add a :doc:`Getting Started <getting-started>` page.
+- Add a :doc:`Getting Started <../getting-started>` page.
 - :class:`Stream` and :class:`Event` creation and some operations are made faster.
 
 
diff --git a/cuda_python/docs/source/_static/logo-dark-mode.png b/cuda_python/docs/source/_static/logo-dark-mode.png
deleted file mode 100644
index 6b005a283b..0000000000
Binary files a/cuda_python/docs/source/_static/logo-dark-mode.png and /dev/null differ
diff --git a/cuda_python/docs/source/_static/logo-light-mode.png b/cuda_python/docs/source/_static/logo-light-mode.png
deleted file mode 100644
index c07d6848c9..0000000000
Binary files a/cuda_python/docs/source/_static/logo-light-mode.png and /dev/null differ
diff --git a/cuda_python/docs/source/index.rst b/cuda_python/docs/source/index.rst
index 49a53b6499..d11cdbd7ed 100644
--- a/cuda_python/docs/source/index.rst
+++ b/cuda_python/docs/source/index.rst
@@ -32,7 +32,7 @@ be available, please refer to the `cuda.bindings`_ documentation for installatio
    :maxdepth: 2
    :caption: Contents:
 
-   release.md
+   release
    cuda.core <https://nvidia.github.io/cuda-python/cuda-core/latest>
    cuda.bindings <https://nvidia.github.io/cuda-python/cuda-bindings/latest>
    cuda.pathfinder <https://github.com/NVIDIA/cuda-python/blob/main/cuda_pathfinder/cuda/pathfinder/README.md>
diff --git a/cuda_python/docs/source/release.md b/cuda_python/docs/source/release.md
deleted file mode 100644
index c73f21ef41..0000000000
--- a/cuda_python/docs/source/release.md
+++ /dev/null
@@ -1,18 +0,0 @@
-# Release Notes
-
-```{toctree}
----
-maxdepth: 3
----
-
-    13.0.1 <release/13.0.1-notes>
-    13.0.0 <release/13.0.0-notes>
-    12.9.2 <release/12.9.2-notes>
-    12.9.1 <release/12.9.1-notes>
-    12.9.0 <release/12.9.0-notes>
-    12.8.0 <release/12.8.0-notes>
-    12.6.2 <release/12.6.2-notes>
-    12.6.1 <release/12.6.1-notes>
-    11.8.7 <release/11.8.7-notes>
-    11.8.6 <release/11.8.6-notes>
-```
diff --git a/cuda_python/docs/source/release.rst b/cuda_python/docs/source/release.rst
new file mode 100644
index 0000000000..c97e508c45
--- /dev/null
+++ b/cuda_python/docs/source/release.rst
@@ -0,0 +1,20 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+Release Notes
+=============
+
+.. toctree::
+   :maxdepth: 3
+
+   13.0.1 <release/13.0.1-notes.rst>
+   13.0.0 <release/13.0.0-notes.rst>
+   12.9.2 <release/12.9.2-notes.rst>
+   12.9.1 <release/12.9.1-notes.rst>
+   12.9.0 <release/12.9.0-notes.rst>
+   12.8.0 <release/12.8.0-notes.rst>
+   12.6.2 <release/12.6.2-notes.rst>
+   12.6.1 <release/12.6.1-notes.rst>
+   11.8.7 <release/11.8.7-notes.rst>
+   11.8.6 <release/11.8.6-notes.rst>
+
diff --git a/cuda_python/docs/source/release/11.8.6-notes.md b/cuda_python/docs/source/release/11.8.6-notes.md
deleted file mode 100644
index c67b71bcc2..0000000000
--- a/cuda_python/docs/source/release/11.8.6-notes.md
+++ /dev/null
@@ -1,15 +0,0 @@
-# CUDA Python 11.8.6 Release notes
-
-Released on January 24, 2025.
-
-## Included components
-
-- [`cuda.bindings` 11.8.6](https://nvidia.github.io/cuda-python/cuda-bindings/12.8.0/release/11.8.6-notes.html)
-
-
-## Highlights
-
-- Support Python 3.13
-- Add optional dependencies on the CUDA NVRTC wheel
-- Enable discovery and loading of shared libraries from CUDA wheels
-- `cuda-python` is now a meta package, currently depending only on `cuda-bindings` ([see RFC](https://github.com/NVIDIA/cuda-python/issues/105))
diff --git a/cuda_python/docs/source/release/11.8.6-notes.rst b/cuda_python/docs/source/release/11.8.6-notes.rst
new file mode 100644
index 0000000000..9d726c5b07
--- /dev/null
+++ b/cuda_python/docs/source/release/11.8.6-notes.rst
@@ -0,0 +1,20 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 11.8.6 Release notes
+================================
+
+Released on January 24, 2025.
+
+Included components
+-------------------
+
+* `cuda.bindings 11.8.6 <https://nvidia.github.io/cuda-python/cuda-bindings/12.8.0/release/11.8.6-notes.html>`_
+
+Highlights
+----------
+
+- Support Python 3.13
+- Add optional dependencies on the CUDA NVRTC wheel
+- Enable discovery and loading of shared libraries from CUDA wheels
+- ``cuda-python`` is now a meta package, currently depending only on ``cuda-bindings`` (`see RFC <https://github.com/NVIDIA/cuda-python/issues/105>`_)
diff --git a/cuda_python/docs/source/release/12.6.1-notes.md b/cuda_python/docs/source/release/12.6.1-notes.md
deleted file mode 100644
index 9a812afc9f..0000000000
--- a/cuda_python/docs/source/release/12.6.1-notes.md
+++ /dev/null
@@ -1,12 +0,0 @@
-# CUDA Python Release notes
-
-Released on Oct 7, 2024
-
-## Included components
-
-- [`cuda.bindings` 12.6.1](https://nvidia.github.io/cuda-python/cuda-bindings/12.6.1/release/12.6.1-notes.html)
-
-
-## Hightlights
-- Internal layout refactoring to prepare for the `cuda-python` metapackage ([Issue #90](https://github.com/NVIDIA/cuda-python/issues/90),
-  [Issue #75](https://github.com/NVIDIA/cuda-python/issues/75))
diff --git a/cuda_python/docs/source/release/12.6.1-notes.rst b/cuda_python/docs/source/release/12.6.1-notes.rst
new file mode 100644
index 0000000000..a882ffea63
--- /dev/null
+++ b/cuda_python/docs/source/release/12.6.1-notes.rst
@@ -0,0 +1,17 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python Release notes
+=========================
+
+Released on Oct 7, 2024
+
+Included components
+-------------------
+
+* `cuda.bindings 12.6.1 <https://nvidia.github.io/cuda-python/cuda-bindings/12.6.1/release/12.6.1-notes.html>`_
+
+Hightlights
+-----------
+- Internal layout refactoring to prepare for the ``cuda-python`` metapackage (`Issue #90 <https://github.com/NVIDIA/cuda-python/issues/90>`_,
+  `Issue #75 <https://github.com/NVIDIA/cuda-python/issues/75>`_)
diff --git a/cuda_python/docs/source/release/12.6.2-notes.md b/cuda_python/docs/source/release/12.6.2-notes.md
deleted file mode 100644
index 96c90e2adb..0000000000
--- a/cuda_python/docs/source/release/12.6.2-notes.md
+++ /dev/null
@@ -1,12 +0,0 @@
-# CUDA Python Release notes
-
-Released on November 5, 2024. Post 1 rebuild released on November 12, 2024.
-
-## Included components
-
-- [`cuda.bindings` 12.6.2](https://nvidia.github.io/cuda-python/cuda-bindings/12.6.2/release/12.6.2-notes.html)
-
-
-## Hightlights
-- Resolve [Issue #215](https://github.com/NVIDIA/cuda-python/issues/215): module `cuda.ccudart` has no attribute `__pyx_capi__`
-- Resolve [Issue #226](https://github.com/NVIDIA/cuda-python/issues/226): top-level Cython source files not packaged
diff --git a/cuda_python/docs/source/release/12.6.2-notes.rst b/cuda_python/docs/source/release/12.6.2-notes.rst
new file mode 100644
index 0000000000..b091fe1de3
--- /dev/null
+++ b/cuda_python/docs/source/release/12.6.2-notes.rst
@@ -0,0 +1,17 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python Release notes
+=========================
+
+Released on November 5, 2024. Post 1 rebuild released on November 12, 2024.
+
+Included components
+-------------------
+
+* `cuda.bindings 12.6.2 <https://nvidia.github.io/cuda-python/cuda-bindings/12.6.2/release/12.6.2-notes.html>`_
+
+Hightlights
+-----------
+- Resolve `Issue #215 <https://github.com/NVIDIA/cuda-python/issues/215>`_: module ``cuda.ccudart`` has no attribute ``__pyx_capi__``
+- Resolve `Issue #226 <https://github.com/NVIDIA/cuda-python/issues/226>`_: top-level Cython source files not packaged
diff --git a/cuda_python/docs/source/release/12.8.0-notes.md b/cuda_python/docs/source/release/12.8.0-notes.md
deleted file mode 100644
index a5df49da2d..0000000000
--- a/cuda_python/docs/source/release/12.8.0-notes.md
+++ /dev/null
@@ -1,21 +0,0 @@
-# CUDA Python 12.8.0 Release notes
-
-Released on January 24, 2025.
-
-## Included components
-
-- [`cuda.bindings` 12.8.0](https://nvidia.github.io/cuda-python/cuda-bindings/12.8.0/release/12.8.0-notes.html)
-
-
-## Highlights
-
-- Support Python 3.13
-- Add bindings for nvJitLink (requires nvJitLink from CUDA 12.3 or above)
-- Add optional dependencies on CUDA NVRTC and nvJitLink wheels
-- Enable discovery and loading of shared libraries from CUDA wheels
-- `cuda-python` is now a meta package, currently depending only on `cuda-bindings` ([see RFC](https://github.com/NVIDIA/cuda-python/issues/105))
-
-
-## Known issues
-
-- Updating from older versions (v12.6.2.post1 and below) via `pip install -U cuda-python` might not work. Please do a clean re-installation by uninstalling `pip uninstall -y cuda-python` followed by installing `pip install cuda-python`.
diff --git a/cuda_python/docs/source/release/12.8.0-notes.rst b/cuda_python/docs/source/release/12.8.0-notes.rst
new file mode 100644
index 0000000000..6634c4ea64
--- /dev/null
+++ b/cuda_python/docs/source/release/12.8.0-notes.rst
@@ -0,0 +1,26 @@
+.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+.. SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE
+
+CUDA Python 12.8.0 Release notes
+================================
+
+Released on January 24, 2025.
+
+Included components
+-------------------
+
+* `cuda.bindings 12.8.0 <https://nvidia.github.io/cuda-python/cuda-bindings/12.8.0/release/12.8.0-notes.html>`_
+
+Highlights
+----------
+
+- Support Python 3.13
+- Add bindings for nvJitLink (requires nvJitLink from CUDA 12.3 or above)
+- Add optional dependencies on CUDA NVRTC and nvJitLink wheels
+- Enable discovery and loading of shared libraries from CUDA wheels
+- ``cuda-python`` is now a meta package, currently depending only on ``cuda-bindings`` (`see RFC <https://github.com/NVIDIA/cuda-python/issues/105>`_)
+
+Known issues
+------------
+
+- Updating from older versions (v12.6.2.post1 and below) via ``pip install -U cuda-python`` might not work. Please do a clean re-installation by uninstalling ``pip uninstall -y cuda-python`` followed by installing ``pip install cuda-python``.