Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 104 additions & 2 deletions docs/guides/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -320,10 +320,14 @@ The cache directory is automatically created if it doesn't exist. You can clear

SQLMesh creates schemas, physical tables, and views in the data warehouse/engine. Learn more about why and how SQLMesh creates schema in the ["Why does SQLMesh create schemas?" FAQ](../faq/faq.md#schema-question).

The default SQLMesh behavior described in the FAQ is appropriate for most deployments, but you can override where SQLMesh creates physical tables and views with the `physical_schema_mapping`, `environment_suffix_target`, and `environment_catalog_mapping` configuration options. These options are in the [environments](../reference/configuration.md#environments) section of the configuration reference page.
The default SQLMesh behavior described in the FAQ is appropriate for most deployments, but you can override *where* SQLMesh creates physical tables and views with the `physical_schema_mapping`, `environment_suffix_target`, and `environment_catalog_mapping` configuration options.

You can also override *what* the physical tables are called by using the `physical_table_naming_convention` option.

These options are in the [environments](../reference/configuration.md#environments) section of the configuration reference page.

#### Physical table schemas
By default, SQLMesh creates physical tables for a model with a naming convention of `sqlmesh__[model schema]`.
By default, SQLMesh creates physical schemas for a model with a naming convention of `sqlmesh__[model schema]`.

This can be overridden on a per-schema basis using the `physical_schema_mapping` option, which removes the `sqlmesh__` prefix and uses the [regex pattern](https://docs.python.org/3/library/re.html#regular-expression-syntax) you provide to map the schemas defined in your model to their corresponding physical schemas.

Expand Down Expand Up @@ -436,6 +440,104 @@ Given the example of a model called `my_schema.users` with a default catalog of
- Using `environment_suffix_target: catalog` only works on engines that support querying across different catalogs. If your engine does not support cross-catalog queries then you will need to use `environment_suffix_target: schema` or `environment_suffix_target: table` instead.
- Automatic catalog creation is not supported on all engines even if they support cross-catalog queries. For engines where it is not supported, the catalogs must be managed externally from SQLMesh and exist prior to invoking SQLMesh.

#### Physical table naming convention

Out of the box, SQLMesh has the following defaults set:

- `environment_suffix_target: schema`
- `physical_table_naming_convention: schema_and_table`
- no `physical_schema_mapping` overrides, so a `sqlmesh__<model schema>` physical schema will be created for each model schema

This means that given a catalog of `warehouse` and a model named `finance_mart.transaction_events_over_threshold`, SQLMesh will create physical tables using the following convention:

```
# <catalog>.sqlmesh__<schema>.<schema>__<table>__<fingerprint>

warehouse.sqlmesh__finance_mart.finance_mart__transaction_events_over_threshold__<fingerprint>
```

This deliberately contains some redundancy with the *model* schema as it's repeated at the physical layer in both the physical schema name as well as the physical table name.

This default exists to make the physical table names portable between different configurations. If you were to define a `physical_schema_mapping` that maps all models to the same physical schema, since the model schema is included in the table name as well, there are no naming conflicts.

##### Table only

Some engines have object name length limitations which cause them to [silently truncate](https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS) table and view names that exceed this limit. This behaviour breaks SQLMesh, so we raise a runtime error if we detect the engine would silently truncate the name of the table we are trying to create.

Having redundancy in the physical table names does reduce the number of characters that can be utilised in model names. To increase the number of characters available to model names, you can use `physical_table_naming_convention` like so:

=== "YAML"

```yaml linenums="1"
physical_table_naming_convention: table_only
```

=== "Python"

```python linenums="1"
from sqlmesh.core.config import Config, ModelDefaultsConfig, TableNamingConvention

config = Config(
model_defaults=ModelDefaultsConfig(dialect=<dialect>),
physical_table_naming_convention=TableNamingConvention.TABLE_ONLY,
)
```

This will cause SQLMesh to omit the model schema from the table name and generate physical names that look like (using the above example):
```
# <catalog>.sqlmesh__<schema>.<table>__<fingerprint>

warehouse.sqlmesh__finance_mart.transaction_events_over_threshold__<fingerprint>
```

Notice that the model schema name is no longer part of the physical table name. This allows for slightly longer model names on engines with low identifier length limits, which may be useful for your project.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the docs right now doesn't make it obvious why the schema is included in the physical table's name by default. Should we include an example to explain the rationale behind that choice?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't know what the rationale behind the original choice was.

The only thing I can think of is to allow someone to set:

physical_schema_mapping:
  '.*': some_schema

Which would map every model to the same physical schema regardless of the model schema. In this situation, including the model schema within the model's table name would be helpful to disambiguate foo.model_a and bar.model_a if they would both be written as some_schema.model_a at the physical layer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I believe the default exists because it was the lowest common denominator that worked across any rendition of physical_schema_mapping.

I've updated the docs


In this configuration, it is your responsibility to ensure that any schema overrides in `physical_schema_mapping` result in each model schema getting mapped to a unique physical schema.

For example, the following configuration will cause **data corruption**:

```yaml
physical_table_naming_convention: table_only
physical_schema_mapping:
'.*': sqlmesh
```

This is because every model schema is mapped to the same physical schema but the model schema name is omitted from the physical table name.

##### MD5 hash

If you *still* need more characters, you can set `physical_table_naming_convention: hash_md5` like so:

=== "YAML"

```yaml linenums="1"
physical_table_naming_convention: hash_md5
```

=== "Python"

```python linenums="1"
from sqlmesh.core.config import Config, ModelDefaultsConfig, TableNamingConvention

config = Config(
model_defaults=ModelDefaultsConfig(dialect=<dialect>),
physical_table_naming_convention=TableNamingConvention.HASH_MD5,
)
```

This will cause SQLMesh generate physical names that are always 45-50 characters in length and look something like:

```
# sqlmesh_md5__<hash of what we would have generated using 'schema_and_table'>

sqlmesh_md5__d3b07384d113edec49eaa6238ad5ff00

# or, for a dev preview
sqlmesh_md5__d3b07384d113edec49eaa6238ad5ff00__dev
```

This has a downside that now it's much more difficult to determine which table corresponds to which model by just looking at the database with a SQL client. However, the table names have a predictable length so there are no longer any surprises with identfiers exceeding the max length at the physical layer.

#### Environment view catalogs

By default, SQLMesh creates an environment view in the same [catalog](../concepts/glossary.md#catalog) as the physical table the view points to. The physical table's catalog is determined by either the catalog specified in the model name or the default catalog defined in the connection.
Expand Down
41 changes: 27 additions & 14 deletions docs/reference/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,32 +16,44 @@ This section describes the other root level configuration parameters.

Configuration options for SQLMesh project directories.

| Option | Description | Type | Required |
| ------------------ | ------------------------------------------------------------------------------------------------------------------ | :----------: | :------: |
| `ignore_patterns` | Files that match glob patterns specified in this list are ignored when scanning the project folder (Default: `[]`) | list[string] | N |
| `project` | The project name of this config. Used for [multi-repo setups](../guides/multi_repo.md). | string | N |
| Option | Description | Type | Required |
| ------------------ | --------------------------------------------------------------------------------------------------------------------------- | :----------: | :------: |
| `ignore_patterns` | Files that match glob patterns specified in this list are ignored when scanning the project folder (Default: `[]`) | list[string] | N |
| `project` | The project name of this config. Used for [multi-repo setups](../guides/multi_repo.md). | string | N |
| `cache_dir` | The directory to store the SQLMesh cache. Can be an absolute path or relative to the project directory. (Default: `.cache`) | string | N |
| `log_limit` | The default number of historical log files to keep (Default: `20`) | int | N |

### Environments
### Database (Physical Layer)

Configuration options for SQLMesh environment creation and promotion.
Configuration options for how SQLMesh manages database objects in the [physical layer](../concepts/glossary.md#physical-layer).

| Option | Description | Type | Required |
|-------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------:|:--------:|
| `snapshot_ttl` | The period of time that a model snapshot not a part of any environment should exist before being deleted. This is defined as a string with the default `in 1 week`. Other [relative dates](https://dateparser.readthedocs.io/en/latest/) can be used, such as `in 30 days`. (Default: `in 1 week`) | string | N |
| `physical_schema_override` | (Deprecated) Use `physical_schema_mapping` instead. A mapping from model schema names to names of schemas in which physical tables for the corresponding models will be placed. | dict[string, string] | N |
| `physical_schema_mapping` | A mapping from regular expressions to names of schemas in which physical tables for the corresponding models [will be placed](../guides/configuration.md#physical-table-schemas). (Default physical schema name: `sqlmesh__[model schema]`) | dict[string, string] | N |
| `physical_table_naming_convention`| Sets which parts of the model name are included in the physical table names. Options are `schema_and_table`, `table_only` or `hash_md5` - [additional details](../guides/configuration.md#physical-table-naming-convention). (Default: `schema_and_table`) | string | N |

### Environments (Virtual Layer)

Configuration options for how SQLMesh manages environment creation and promotion in the [virtual layer](../concepts/glossary.md#virtual-layer).

| Option | Description | Type | Required |
|-------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------:|:--------:|
| `environment_ttl` | The period of time that a development environment should exist before being deleted. This is defined as a string with the default `in 1 week`. Other [relative dates](https://dateparser.readthedocs.io/en/latest/) can be used, such as `in 30 days`. (Default: `in 1 week`) | string | N |
| `pinned_environments` | The list of development environments that are exempt from deletion due to expiration | list[string] | N |
| `time_column_format` | The default format to use for all model time columns. This time format uses [python format codes](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes) (Default: `%Y-%m-%d`) | string | N |
| `default_target_environment` | The name of the environment that will be the default target for the `sqlmesh plan` and `sqlmesh run` commands. (Default: `prod`) | string | N |
| `physical_schema_override` | (Deprecated) Use `physical_schema_mapping` instead. A mapping from model schema names to names of schemas in which physical tables for the corresponding models will be placed. | dict[string, string] | N |
| `physical_schema_mapping` | A mapping from regular expressions to names of schemas in which physical tables for the corresponding models [will be placed](../guides/configuration.md#physical-table-schemas). (Default physical schema name: `sqlmesh__[model schema]`) | dict[string, string] | N |
| `environment_suffix_target` | Whether SQLMesh views should append their environment name to the `schema` or `table` - [additional details](../guides/configuration.md#view-schema-override). (Default: `schema`) | string | N |
| `gateway_managed_virtual_layer` | Whether SQLMesh views of the virtual layer will be created by the default gateway or model specified gateways - [additional details](../guides/multi_engine.md#gateway-managed-virtual-layer). (Default: False) | boolean | N |
| `infer_python_dependencies` | Whether SQLMesh will statically analyze Python code to automatically infer Python package requirements. (Default: True) | boolean | N |
| `environment_suffix_target` | Whether SQLMesh views should append their environment name to the `schema`, `table` or `catalog` - [additional details](../guides/configuration.md#view-schema-override). (Default: `schema`) | string | N |
| `gateway_managed_virtual_layer` | Whether SQLMesh views of the virtual layer will be created by the default gateway or model specified gateways - [additional details](../guides/multi_engine.md#gateway-managed-virtual-layer). (Default: False) | boolean | N |
| `environment_catalog_mapping` | A mapping from regular expressions to catalog names. The catalog name is used to determine the target catalog for a given environment. | dict[string, string] | N |
| `log_limit` | The default number of logs to keep (Default: `20`) | int | N |

### Model defaults
### Models

| Option | Description | Type | Required |
|-------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------:|:--------:|
| `time_column_format` | The default format to use for all model time columns. This time format uses [python format codes](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes) (Default: `%Y-%m-%d`) | string | N |
| `infer_python_dependencies` | Whether SQLMesh will statically analyze Python code to automatically infer Python package requirements. (Default: True) | boolean | N |
| `model_defaults` | Default [properties](./model_configuration.md#model-defaults) to set on each model. At a minimum, `dialect` must be set. | dict[string, any] | Y |

The `model_defaults` key is **required** and must contain a value for the `dialect` key.

Expand Down Expand Up @@ -82,6 +94,7 @@ Configuration for the `sqlmesh plan` command.
| `no_diff` | Don't show diffs for changed models (Default: False) | boolean | N |
| `no_prompts` | Disables interactive prompts in CLI (Default: True) | boolean | N |
| `always_recreate_environment` | Always recreates the target environment from the environment specified in `create_from` (by default `prod`) (Default: False) | boolean | N |

## Run

Configuration for the `sqlmesh run` command. Please note that this is only applicable when configured with the [builtin](#builtin) scheduler.
Expand Down
Loading