USE 306 - Incremental table read error handling#182
Merged
Conversation
Why these changes are being introduced: There are a copule of primary ways in which a read method can fail based on the table requested: 1. The table name is invalid and will never work. 2. The table name is valid, but does not yet exist in the DuckDB context. For the first, we want to raise an error immediately. For the second, there is a bit more nuance depending on the table requested. How this addresses that need: For TIMDEXDataset.read_* methods, we operate from the assumption that `records` and `current_records` should always be available and so we raise an error indicating a metadata rebuild is required. But for TIMDEXEmbeddings, and potentially other data sources as added, we may legitimately not have data yet. As such, we'll want to log warnings and suggest a refresh, but just return an empty set. Side effects of this change: * Applications like TIM, which now attempt embeddings reading for its `reindex-source` CLI command, will no longer encounter an error if embeddings don't yet exist. * In the rare edge cases of a brand new dataset, we have better error raising and logging. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-306
ghukill
commented
Dec 22, 2025
Comment on lines
-305
to
-307
|
|
||
| # NOTE: at time of test creation, this manual reload is required | ||
| td = TIMDEXDataset(td.location) |
Contributor
Author
There was a problem hiding this comment.
This and other similar lines: based on previous refactor, missed at the time, we can lean on the improved .refresh() method!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose and background context
Primary git commit:
Why these changes are being introduced:
There are a copule of primary ways in which a read method can fail based on the table requested:
For the first, we want to raise an error immediately. For the second, there is a bit more nuance depending on the table requested.
How this addresses that need:
For
TIMDEXDataset.read_*methods, we operate from the assumption thatrecordsandcurrent_recordsshould always be available and so we raise an error indicating a metadata rebuild is required.But for
TIMDEXEmbeddings.read_*methods, and potentially other data sources as added, we may legitimately not have data yet. As such, we'll want to log warnings and suggest a refresh, but just return an empty set.Additional commentary:
This is an incremental improvment to a problem that runs a little deeper.
The improvement is more explicit errors and logging right now. For example, TIM can now attempt to read embeddings for a source during the CLI command
reindex-sourcewithout error, just an empty set if they don't yet exist. We also raise meaningful errors for a brand new dataset, where are after the first records write, we need to build the metadata a first (and only) time.Both of these edge cases point to a slightly deeper need: the creation of DuckDB schemas and tables explicitly, from clearly visible code in TDA, that does not require actual data to exist yet. If we do this, then read methods will naturally just return nothing if nothing exists, beacuse the schemas / tables will always exist. This also serves the function of self-documenting what schemas/tables make up the dataset.
Currently, we rely on creating schemas/tables from pre-existing parquet data. This is often a very fine and good approach, and has served us well, but the dataset structure and patterns have matured enough, we could probably define these tables more explicitly in code solving these edge cases and better documenting the dataset structure.
How can a reviewer manually see the effects of these changes?
1- Start ipython shell
2- Prepare a brand new dataset:
3- Records error handling:
4- Embeddings error handling:
Includes new or updated dependencies?
NO
Changes expectations for external applications?
YES:
reindex-sourceCLI command, will no longer encounter an error if embeddings don't yet exist.What are the relevant tickets?
Code review