Summary
The Papers Library Importer allows researchers to bulk import documents into an Extralit workspace by uploading a .bib file and a folder of PDFs. It provides:
- Analysis Phase: Parse bibliographic entries and match PDFs to references; classify as add/update/skip/failed.
- Preview Phase: Display a reviewable table of import candidates with metadata and status, and allow action overrides.
- Execution Phase: Perform asynchronous bulk upsert (add or update documents), upload PDFs to S3, and track progress via real-time job updates.
- Integration: Persist structured bibliographic metadata, tag imports by collection and source, and surface imported documents alongside existing ones.
Key Requirements
-
Metadata Extraction & Matching
- Parse
.bib entries (title, authors, year, DOI/PMID, reference key).
- Match PDF filenames to entries (exact, partial, or fuzzy match).
- Tag unmatched or failed PDFs.
-
Preview & Confirmation
- Show import preview table with reference key, title, authors, year, files, and status.
- Allow users to override add/update/skip on a per‐document basis.
-
Bulk Import Execution
- Asynchronously upsert documents and upload files in batches (20–50 per request).
- Track progress and report summary counts (added, updated, skipped, failed).
- Continue processing on individual errors; report specific failures.
-
Workspace Integration
- Store bibliographic metadata in
Document records (reference key, DOI, PMID).
- Add
collections and source: "bib_import" tags to imported documents.
- Ensure imported documents appear with existing documents for extraction workflows.
-
Error Handling & Security
- Provide clear errors for malformed BibTeX, unreadable PDFs, upload failures, duplicates, and quota issues.
- Validate file types/sizes, sanitize BibTeX input, and integrate virus scanning.
- Support retry and resume for interrupted imports.
Acceptance Criteria
POST /imports/analyze returns status for each entry.
- Preview UI displays editable import actions.
POST /documents/bulk enqueues upload jobs and returns job IDs.
- Progress UI tracks jobs in real time and handles errors.
- Imported documents carry correct metadata and tags.
- Comprehensive unit/integration tests for all phases.
Files & Components
- Backend: handlers (
imports.py, documents.py), contexts (imports.py), jobs (document_jobs.py), ImportHistory model.
- Frontend: Vue components
ImportUpload.vue, ImportPreview.vue, ImportProgress.vue, ImportResults.vue, and the page at pages/workspace/_id/import.vue.
- Schemas: Pydantic models for
ImportAnalysisRequest/Response, BulkUploadMetadata, and ImportHistory DB schema.
Summary
The Papers Library Importer allows researchers to bulk import documents into an Extralit workspace by uploading a
.bibfile and a folder of PDFs. It provides:Key Requirements
Metadata Extraction & Matching
.bibentries (title, authors, year, DOI/PMID, reference key).Preview & Confirmation
Bulk Import Execution
Workspace Integration
Documentrecords (reference key, DOI, PMID).collectionsandsource: "bib_import"tags to imported documents.Error Handling & Security
Acceptance Criteria
POST /imports/analyzereturns status for each entry.POST /documents/bulkenqueues upload jobs and returns job IDs.Files & Components
imports.py,documents.py), contexts (imports.py), jobs (document_jobs.py),ImportHistorymodel.ImportUpload.vue,ImportPreview.vue,ImportProgress.vue,ImportResults.vue, and the page atpages/workspace/_id/import.vue.ImportAnalysisRequest/Response,BulkUploadMetadata, andImportHistoryDB schema.