Skip to content

[FIX] Import dataset mapping#140

Merged
JonnyTran merged 10 commits into
developfrom
fix/import-dataset-mapping
Aug 29, 2025
Merged

[FIX] Import dataset mapping#140
JonnyTran merged 10 commits into
developfrom
fix/import-dataset-mapping

Conversation

@JonnyTran
Copy link
Copy Markdown
Member

This pull request introduces a new, structured dataset mapping model to the Extralit codebase, replacing the previous ad-hoc dictionary-based mapping approach. The changes span both the backend and client libraries, updating API endpoints, models, and serialization logic to support the new DatasetMappingModel and DatasetMapping abstractions. Additionally, several database index names are corrected for consistency, and the API endpoints for dataset import operations are clarified and renamed.

Key changes:

Dataset Mapping Model Introduction and Integration

  • Added DatasetMappingModel and related classes to provide a structured, validated way to represent dataset mappings (extralit/src/extralit/_models/_settings/_mapping.py).
  • Introduced DatasetMapping abstraction in the settings layer, with methods for conversion between models and dictionaries, and updated all relevant code to use this new abstraction instead of raw dictionaries (extralit/src/extralit/settings/_mapping.py, extralit/src/extralit/settings/_resource.py). [1] [2] [3] [4] [5] [6] [7] [8] [9]
  • Updated DatasetModel to include a mapping field and ensured proper serialization and deserialization throughout the stack (extralit/src/extralit/_models/_dataset.py). [1] [2]

API and Schema Adjustments

  • Updated Pydantic schemas and getter logic to expose the new mapping field and validate it using the new model (extralit-server/src/extralit_server/api/schemas/v1/datasets.py). [1] [2]
  • Changed API endpoints for dataset import operations to be more explicit: /import-hub for HuggingFace imports and /import for import-history-based imports (extralit-server/src/extralit_server/api/handlers/v1/datasets/datasets.py, extralit-frontend/v1/infrastructure/repositories/DatasetRepository.ts). [1] [2] [3]

Database Index Naming Consistency

  • Renamed several database indexes for the imports table to use a consistent naming convention, both in Alembic migrations and in the SQLAlchemy model (extralit-server/src/extralit_server/alembic/versions/7d6b33203390_create_import_history_table.py, .kiro/specs/papers-library-importer/design.md). [1] [2]

Miscellaneous Model Improvements

  • Added a serializer for workspace_id in DocumentModel to ensure proper string conversion (extralit/src/extralit/_models/_document.py).
  • Minor import and code hygiene improvements to support the new mapping model (extralit/src/extralit/_models/_document.py, extralit/src/extralit/settings/_resource.pyR29)

These changes collectively modernize and standardize how dataset mappings are handled across Extralit, improving validation, maintainability, and clarity throughout the codebase.

@JonnyTran JonnyTran self-assigned this Aug 26, 2025
* added GET "/datasets/compatible"

* Add GetImportCompatibleDatasets use case and integrate into dataset configuration

* Enhance dataset creation workflow with update functionality and new dialog.

- Added DatasetUpdateDialog component for updating datasets, integrated data source selection, and improved dataset configuration forms. - Updated translations for button labels and added validation for compatible datasets.

* latest

* Implement dataset update functionality and improve error handling.

- Introduced UpdateDatasetUseCase for handling dataset updates.
- Enhanced DatasetConfigurationForm and DatasetUpdateDialog to support source and target dataset selection.
- Added error handling and validation for dataset updates in the relevant components.
- Updated useDatasetConfigurationForm to include the new update method.

* refactor

* Refactor dataset creation components and introduce DatasetCreateDialog.

- Renamed DatasetConfigurationDialog to DatasetCreateDialog for clarity.
- Added new DatasetCreateDialog component to handle dataset creation with improved UI and validation.
- Updated useDatasetConfigurationNameAndWorkspace to remove unused imports.

* refactoring

* Enhance error handling in AxiosErrorHandler and DocumentRepository.

- Prioritize specific error messages in AxiosErrorHandler based on business logic, detailed messages, and generic HTTP status messages.
- Update DocumentRepository to include a new error constant for listing documents and adjust error handling accordingly.
- Modify error detail in documents.py to provide more specific feedback when no documents are found.

* Refactor dataset configuration components to support TypeScript.

- Updated DatasetConfigurationForm, DatasetConfigurationMetadataSelector, and DatasetCreateDialog to use TypeScript for improved type safety.
- Enhanced validator functions in DatasetConfigurationForm and DatasetCreateDialog to specify parameter types.

* fix extralit/unit tests

* fix tests

* fix tests

* latest

* fix tests

* fix DatasetMapping

* Refactor DatasetCreation and ImportHistoryDatasetBuilder to replace external_id with source_id and target_id.

- Updated DatasetCreation to use source_id and target_id for improved clarity in dataset mappings.
- Modified ImportHistoryDatasetBuilder to align with the new DatasetCreation structure, ensuring proper mapping of source_id and target_id.

* test fixes

* fix tests

* fix tests

* Revert "fix tests"

This reverts commit 3ecb423.

* fix tests

* fix tests

* Refactor document handling in WorkspacesAPI and update related tests

- Removed deprecated document methods from WorkspacesAPI.
- Updated document creation logic to directly use the Document class.
- Simplified the add_document method in the Workspace class.
- Cleaned up test cases related to document operations in WorkspacesAPI.
@JonnyTran JonnyTran marked this pull request as ready for review August 29, 2025 07:03
@JonnyTran JonnyTran requested review from a team as code owners August 29, 2025 07:03
- Added structured dataset mapping support with `DatasetMappingModel` and `DatasetMapping` abstractions.
- Introduced `mapping` field in `DatasetModel` for enhanced serialization/deserialization.
- Implemented incremental dataset import functionality in the frontend with `DatasetUpdateDialog`.
- Refactored backend endpoints for improved document fetching and dataset import processes.
- Fixed various issues related to document handling and import analysis display.
- Updated version numbers across all components to reflect the new release.
@JonnyTran JonnyTran merged commit c17aaef into develop Aug 29, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant