USE 306 - Incremental table read error handling by ghukill · Pull Request #182 · MITLibraries/timdex-dataset-api

ghukill · 2025-12-22T15:15:19Z

Purpose and background context

Primary git commit:

Why these changes are being introduced:

There are a copule of primary ways in which a read method can fail based on the table requested:

The table name is invalid and will never work.
The table name is valid, but does not yet exist in the DuckDB context.

For the first, we want to raise an error immediately. For the second, there is a bit more nuance depending on the table requested.

How this addresses that need:

For TIMDEXDataset.read_* methods, we operate from the assumption that records and current_records should always be available and so we raise an error indicating a metadata rebuild is required.

But for TIMDEXEmbeddings.read_* methods, and potentially other data sources as added, we may legitimately not have data yet. As such, we'll want to log warnings and suggest a refresh, but just return an empty set.

Additional commentary:

This is an incremental improvment to a problem that runs a little deeper.

The improvement is more explicit errors and logging right now. For example, TIM can now attempt to read embeddings for a source during the CLI command reindex-source without error, just an empty set if they don't yet exist. We also raise meaningful errors for a brand new dataset, where are after the first records write, we need to build the metadata a first (and only) time.

Both of these edge cases point to a slightly deeper need: the creation of DuckDB schemas and tables explicitly, from clearly visible code in TDA, that does not require actual data to exist yet. If we do this, then read methods will naturally just return nothing if nothing exists, beacuse the schemas / tables will always exist. This also serves the function of self-documenting what schemas/tables make up the dataset.

Currently, we rely on creating schemas/tables from pre-existing parquet data. This is often a very fine and good approach, and has served us well, but the dataset structure and patterns have matured enough, we could probably define these tables more explicitly in code solving these edge cases and better documenting the dataset structure.

How can a reviewer manually see the effects of these changes?

1- Start ipython shell

2- Prepare a brand new dataset:

import shutil

from timdex_dataset_api import TIMDEXDataset
from timdex_dataset_api.config import configure_dev_logger
from tests.utils import generate_sample_records, generate_sample_embeddings_for_run

logger = configure_dev_logger()

DATASET_PATH = "/tmp/use306"
RUN_ID = "abc123"

# delete if already exists
shutil.rmtree(DATASET_PATH)

td = TIMDEXDataset(DATASET_PATH)

3- Records error handling:

td.read_dataframe()
# Failure: no records yet, so not metadata tables yet, raise exception

# write records
td.write(
    generate_sample_records(
        num_records=10,
        run_id=RUN_ID,
        run_date="2025-12-19",
        run_type="full",
        action="index",
    )
)

td.read_dataframe()
# Failure: we have records, but still need to build metadata, raise exception

td.metadata.rebuild_dataset_metadata()

print(td.read_dataframe())
# success

4- Embeddings error handling:

td.embeddings.read_dataframe()
# Failure: no embeddings, log warning; yield empty set / no exception

# write embeddings
td.embeddings.write(generate_sample_embeddings_for_run(td, run_id=RUN_ID))

td.embeddings.read_dataframe()
# Failure: embeddings exist now, but not fully established in DuckDB context; yield empty set / no exception

td.refresh()

print(td.read_dataframe())
# success

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES:

Applications like TIM, which now attempt embeddings reading for its reindex-source CLI command, will no longer encounter an error if embeddings don't yet exist.
In the rare edge cases of a brand new dataset, we have better error raising and logging.

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-306

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced: There are a copule of primary ways in which a read method can fail based on the table requested: 1. The table name is invalid and will never work. 2. The table name is valid, but does not yet exist in the DuckDB context. For the first, we want to raise an error immediately. For the second, there is a bit more nuance depending on the table requested. How this addresses that need: For TIMDEXDataset.read_* methods, we operate from the assumption that `records` and `current_records` should always be available and so we raise an error indicating a metadata rebuild is required. But for TIMDEXEmbeddings, and potentially other data sources as added, we may legitimately not have data yet. As such, we'll want to log warnings and suggest a refresh, but just return an empty set. Side effects of this change: * Applications like TIM, which now attempt embeddings reading for its `reindex-source` CLI command, will no longer encounter an error if embeddings don't yet exist. * In the rare edge cases of a brand new dataset, we have better error raising and logging. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-306

ghukill · 2025-12-22T15:22:22Z

tests/test_embeddings.py

-
-    # NOTE: at time of test creation, this manual reload is required
-    td = TIMDEXDataset(td.location)


This and other similar lines: based on previous refactor, missed at the time, we can lean on the improved .refresh() method!

ehanson8

Works as expected, approved!

ghukill commented Dec 22, 2025

View reviewed changes

ghukill marked this pull request as ready for review December 22, 2025 15:22

ghukill requested a review from a team as a code owner December 22, 2025 15:22

jonavellecuerdo self-assigned this Dec 22, 2025

ehanson8 approved these changes Dec 22, 2025

View reviewed changes

ghukill merged commit 03521ef into main Dec 22, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USE 306 - Incremental table read error handling#182

USE 306 - Incremental table read error handling#182
ghukill merged 1 commit intomainfrom
USE-306-handle-missing-metadata-or-embeddings

ghukill commented Dec 22, 2025 •

edited

Loading

Uh oh!

ghukill Dec 22, 2025

Uh oh!

ehanson8 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		# NOTE: at time of test creation, this manual reload is required
		td = TIMDEXDataset(td.location)

Conversation

ghukill commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

ghukill Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ghukill commented Dec 22, 2025 •

edited

Loading