Skip to content

Conversation

@morningman
Copy link
Contributor

bp #33610

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

…nal catalog (apache#33610)

1. **Master FE** uniformly retrieves table information and generates the corresponding `id -> name` mapping.
2. The `id -> name` mapping is stored in Doris's metadata and persisted.
3. **Master FE** synchronizes this information with other FEs via **EditLog**.
4. To update the table information, a `refresh` command must be executed or the metadata synchronized through an **HMS event**.

- **Advantage**: All FEs can see a consistent list of tables as the information is uniformly obtained from the Master FE, preventing discrepancies in table visibility across different FEs.
- **Disadvantage**: There is an inability to promptly perceive changes in the tables. For example, a new table on the Hive side is not immediately visible on the Doris side and requires a refresh or periodic metadata refresh for visibility.

- **Catalog** adds a new property `use_meta_cache`. Default is `false`. If set to `true`, it will use an independent caching method to synchronize table information.

- Once enabled, table information will no longer be uniformly obtained by Master FE but will instead be independently fetched by each FE.
- Each FE has its own cache of the Database and Table list, implemented using the **Caffeine library**.
- This cache synchronously loads table information when accessed. If the cache does not exist, it will directly access HMS for table information.
- **Behaviors**:
  - Different FEs may see different table information due to different loading times, but they will eventually be consistent.
  - New tables created on the Hive side can be queried directly in Doris, but may not be visible in `show databases` or `show tables`.
  - Tables deleted on the Hive side will still appear in `show databases/tables` but will be inaccessible.
  - All caches will refresh at most every 10 minutes.

- **Compatibility**:
  - For already created catalog, after upgrade, the `use_meta_cache` is `false`.
  - For newly created catalog, if `use_meta_cache` is not set, set it as `false`.
  - Can not modify `use_meta_cache` after being created.

- **MetaCache**:
  - A general Cache class responsible for caching Database/Table information, including two LoadingCaches for storing "name lists" and "name-to-object" caches.

- **ID Generation Rules**:
  - As table information is no longer uniformly fetched by the Master FE, a consistent rule must exist to ensure that each FE generates the same ID for the same table. Here, we use the absolute value of the top 8 bits of the sha256 hash of the table name as the object's ID. This way, the same name generates the same ID, but the ID is no longer globally unique and is unique only within the Catalog or Database level.

- Remove some unused methods such as `getIdToTable()` and `getIdToDb()`

I have run the p0 for both `use_meta_cache = true` and `use_meta_cache=false`
@morningman morningman merged commit 3ae3f9d into apache:branch-2.1 May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants