Skip to content

Chore: Cache results of get_data_objects#5467

Merged
izeigerman merged 3 commits intomainfrom
chore-cache-data-objects
Oct 2, 2025
Merged

Chore: Cache results of get_data_objects#5467
izeigerman merged 3 commits intomainfrom
chore-cache-data-objects

Conversation

@izeigerman
Copy link
Collaborator

This reduces the number of INFORMATION_SCHEMA queries from 28 down to 6 on the fresh sushi project. I anticipate equally dramatic improvements in other scenarios.

@izeigerman izeigerman requested review from a team and Copilot September 30, 2025 22:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements caching for the get_data_objects method to reduce INFORMATION_SCHEMA queries from 28 to 6 in the fresh sushi project. The optimization adds an in-memory cache that stores data object lookups and intelligently invalidates entries when DDL operations modify the database schema.

  • Adds _data_object_cache dictionary to track data object existence/non-existence
  • Implements cache invalidation on create/drop operations for tables, views, and other data objects
  • Refactors snapshot evaluator methods to distinguish between physical and virtual data objects

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
sqlmesh/core/engine_adapter/base.py Adds caching logic to get_data_objects, cache invalidation on DDL operations, and helper functions
sqlmesh/core/snapshot/evaluator.py Refactors data object retrieval methods and adds virtual data object pre-fetching for promotion
tests/core/engine_adapter/test_base.py Adds comprehensive test coverage for caching functionality and invalidation scenarios
tests/core/test_snapshot_evaluator.py Updates test mocks to accommodate new caching behavior
tests/core/engine_adapter/test_snowflake.py Fixes test parameter format from identifier objects to string literals
tests/core/engine_adapter/test_athena.py Adds cache clearing call to maintain test isolation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +361 to +362
adapter.drop_managed_table(table_name="foo.bar", exists=False)
adapter.drop_managed_table(table_name="foo.bar", exists=True)
Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change from exp.parse_identifier("foo") to "foo.bar" string literal appears to be fixing a test issue, but this changes the test from testing a single identifier to a qualified name. Verify this change aligns with the intended test behavior and that the corresponding SQL output changes are expected.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the API was never meant to take identifiers in its argument

@izeigerman izeigerman force-pushed the chore-cache-data-objects branch from fca603b to ea82df0 Compare September 30, 2025 22:21
@eakmanrq
Copy link
Collaborator

eakmanrq commented Sep 30, 2025

@izeigerman how did you determine where to add the clears? For example I think replace query needs it since it checks for table existence. Unless I'm not reading the implementation correctly, you record the lack of existence in cache and therefore the cache would need to be updated after replace query runs if the table did not exist.

Edit: I'm actually not sure why it doesn't use def table_exists but the point still stands generally.

@izeigerman
Copy link
Collaborator Author

izeigerman commented Oct 1, 2025

how did you determine where to add the clears? For example I think replace query needs it since it checks for table existence. Unless I'm not reading the implementation correctly, you record the lack of existence in cache and therefore the cache would need to be updated after replace query runs if the table did not exist.

Wdym? We clear the cache every time we create a table, including during replace query.

Edit: I'm actually not sure why it doesn't use def table_exists but the point still stands generally.

yeah, using the cache for table_exists is a good point.

@izeigerman izeigerman force-pushed the chore-cache-data-objects branch from b91a30a to c7f5771 Compare October 1, 2025 00:38
@eakmanrq
Copy link
Collaborator

eakmanrq commented Oct 1, 2025

Wdym? We clear the cache every time we create a table, including during replace query.

Yes I see now. The internal create call clears the cache which is why it is handled.

Did you verify that no engine overrides a create or drop call that also needs to update the cache?

Also I noticed that public engine adapter methods that accept arbitrary SQL (fetch*, execute) are assumed to not contain create/drop calls. Should we make them pessimistic and clear cache? Could potentially clear the cache is they contain create or drop expressions?

@izeigerman
Copy link
Collaborator Author

Did you verify that no engine overrides a create or drop call that also needs to update the cache?

I did. Do you happen to see places I missed?

Also I noticed that public engine adapter methods that accept arbitrary SQL (fetch*, execute) are assumed to not contain create/drop calls. Should we make them pessimistic and clear cache? Could potentially clear the cache is they contain create or drop expressions?

Let me think about this

@eakmanrq
Copy link
Collaborator

eakmanrq commented Oct 2, 2025

Unless I am missing something, I think theses need reviewed:

https://github.com/TobikoData/sqlmesh/blob/e00e86064995841f097b62c9d32b6e99ed2a0aba/sqlmesh/core/engine_adapter/clickhouse.py#L458-L464
https://github.com/TobikoData/sqlmesh/blob/e00e86064995841f097b62c9d32b6e99ed2a0aba/sqlmesh/core/engine_adapter/clickhouse.py#L584-L592
https://github.com/TobikoData/sqlmesh/blob/e00e86064995841f097b62c9d32b6e99ed2a0aba/sqlmesh/core/engine_adapter/clickhouse.py#L617-L645
https://github.com/TobikoData/sqlmesh/blob/df7d61be3a8d5e17c75d1691ca1cff1a438054ec/sqlmesh/core/engine_adapter/databricks.py#L254-L295
https://github.com/TobikoData/sqlmesh/blob/ff9305cc2f2fef2ffa59888689c193f97b554370/sqlmesh/core/engine_adapter/duckdb.py#L99-L146
https://github.com/TobikoData/sqlmesh/blob/c7f5771cf8b9e2323bc4e67ebfd848cd26674864/sqlmesh/core/engine_adapter/mssql.py#L369-L401
https://github.com/TobikoData/sqlmesh/blob/c7f5771cf8b9e2323bc4e67ebfd848cd26674864/sqlmesh/core/engine_adapter/mssql.py#L407-L414
https://github.com/TobikoData/sqlmesh/blob/ff9305cc2f2fef2ffa59888689c193f97b554370/sqlmesh/core/engine_adapter/mysql.py#L81-L116
https://github.com/TobikoData/sqlmesh/blob/ff9305cc2f2fef2ffa59888689c193f97b554370/sqlmesh/core/engine_adapter/mysql.py#L167-L187
https://github.com/TobikoData/sqlmesh/blob/c7f5771cf8b9e2323bc4e67ebfd848cd26674864/sqlmesh/core/engine_adapter/postgres.py#L82-L103
https://github.com/TobikoData/sqlmesh/blob/454e942fedd7fa768e10024515b1f1b134acb3d9/sqlmesh/core/engine_adapter/redshift.py#L248-L306
https://github.com/TobikoData/sqlmesh/blob/454e942fedd7fa768e10024515b1f1b134acb3d9/sqlmesh/core/engine_adapter/redshift.py#L308-L359
https://github.com/TobikoData/sqlmesh/blob/b29e71bc1a2c82cebd362d497af8ba0236f5e3b1/sqlmesh/core/engine_adapter/snowflake.py#L472-L534
https://github.com/TobikoData/sqlmesh/blob/d41c3e0e3a6c010babf75a9b46795834a0c56730/sqlmesh/core/engine_adapter/spark.py#L354-L390
https://github.com/TobikoData/sqlmesh/blob/e949176a6bd907bcf6f0e054968e6c5c822f9a98/sqlmesh/core/engine_adapter/trino.py#L187-L250

Copy link
Contributor

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting PR, nice work.

@izeigerman izeigerman force-pushed the chore-cache-data-objects branch from d0adc72 to c0d9022 Compare October 2, 2025 17:44
@izeigerman izeigerman force-pushed the chore-cache-data-objects branch from c0d9022 to 032cd97 Compare October 2, 2025 18:29
@izeigerman izeigerman merged commit 42fbc64 into main Oct 2, 2025
45 of 46 checks passed
@izeigerman izeigerman deleted the chore-cache-data-objects branch October 2, 2025 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants