feat: implement deposition upload workflow with semantics domain#62
feat: implement deposition upload workflow with semantics domain#62
Conversation
Add new semantics domain (ontologies + schemas) and enhance the deposition domain with conventions, spreadsheet-based metadata entry, file management, and the validation return-to-draft pipeline. New domains and aggregates: - Ontology: immutable versioned term collections with hierarchy support - Schema: typed field definitions referencing ontologies - Convention: submission templates bundling schema + file requirements Deposition enhancements: - Full lifecycle: create → metadata → files → submit → return-to-draft - Spreadsheet template generation with ontology-backed dropdowns - Spreadsheet upload with structural validation - File upload with SHA-256 checksums and convention enforcement - Domain events for all deposition mutations Infrastructure: - OBO Graphs parser for ontology import from external sources - Postgres repositories for ontology, schema, convention - Cross-domain reader adapters (SchemaReader, OntologyReader) - LocalFileStorageAdapter with atomic writes - OpenpyxlSpreadsheetAdapter for .xlsx generation and parsing - REST routes: /ontologies, /schemas, /conventions, /depositions - Database migration for new tables Closes #30 396 unit tests passing.
- Deposit wizard UI with 4-step flow (convention, template, metadata, files) - Deposition detail page at /deposition/[srn] (client component for auth) - SDK layer rewrite: separate http, auth, search, deposition modules - Mock deposition SDK with tests for demo/development - Auth service: fix session commit, configurable base_role - Worker: remove explicit commits (UOW scope handles it) - Vector index: skip records with no indexable content gracefully - SRN deserialization fix for versioned resources - Seed script for sample convention and schema - Header navigation with Deposit link
|
|
@greptile |
Greptile SummaryLarge feature PR adding two new bounded contexts (Semantics and Deposition) with a full deposition upload workflow: ontology/schema/convention CRUD, deposition lifecycle (DRAFT → IN_VALIDATION → IN_REVIEW → ACCEPTED/REJECTED), spreadsheet template generation/parsing, file storage with checksums, and a frontend deposit wizard. Also includes an SDK rewrite splitting the monolithic
Confidence Score: 2/5
|
| Filename | Overview |
|---|---|
| server/osa/infrastructure/persistence/adapter/storage.py | Path traversal vulnerability: user-supplied filenames used directly in filesystem paths without sanitization in save_file, get_file, and delete_file. |
| server/migrations/versions/add_deposition_tables.py | Migration adds convention_srn as nullable=True but table definition and domain model require it as non-nullable. Creates ontologies, schemas, conventions tables correctly. |
| server/osa/domain/deposition/service/deposition.py | Core deposition service with clean lifecycle management (create, upload, submit), but lacks ownership verification — any DEPOSITOR can modify any deposition. |
| server/osa/application/api/v1/routes/depositions.py | Full CRUD routes for depositions with file upload/download. Content-Disposition header injection risk and full file buffering in memory for uploads. |
| server/osa/domain/deposition/model/aggregate.py | Well-structured aggregate with proper state guards (_require_draft), clear state transitions, and appropriate domain validation logic. |
| server/osa/infrastructure/persistence/adapter/spreadsheet.py | Openpyxl adapter for template generation and parsing. Only parses row 3 (single data row), which may be intentional for v1 but limits multi-row spreadsheets. |
| server/osa/infrastructure/persistence/di.py | Added auto-commit to session teardown, replacing manual commits throughout worker. Significant behavioral change that commits even for read-only operations. |
| server/osa/domain/shared/model/srn.py | Extended SRN system with OntologySRN, ConventionSRN types and added model_validator for string parsing and model_serializer for string output. Clean Pydantic integration. |
| server/osa/domain/auth/service/auth.py | Added base_role auto-assignment for new users. Clean implementation with proper logging, though verbose debug logging could be cleaned up before production. |
| web/src/lib/sdk/index.ts | SDK refactored from single OSAClient to namespace pattern (auth/search/deposition) with mock layer support. Clean separation of concerns. |
| web/src/components/deposit/DepositWizard.tsx | 4-step deposit wizard with convention selection, template download, metadata upload, and data file upload. No error handling shown to user on submission failure. |
| server/osa/infrastructure/persistence/tables.py | Added ontologies, ontology_terms, schemas, and conventions tables with proper indexes and constraints. Convention_srn on depositions correctly marked as non-nullable (mismatched with migration). |
Sequence Diagram
sequenceDiagram
participant U as User/Frontend
participant API as FastAPI Routes
participant CMD as Command Handlers
participant SVC as DepositionService
participant CONV as ConventionRepo
participant REPO as DepositionRepo
participant FS as FileStorage
participant OB as Outbox
participant W as Worker
participant VH as ValidateDeposition
U->>API: POST /depositions {convention_srn}
API->>CMD: CreateDepositionHandler.run()
CMD->>SVC: create(convention_srn, owner_id)
SVC->>CONV: get(convention_srn)
CONV-->>SVC: Convention
SVC->>REPO: save(deposition)
SVC->>OB: append(DepositionCreatedEvent)
SVC-->>U: DepositionCreated {srn}
U->>API: POST /depositions/{srn}/spreadsheet
API->>CMD: UploadSpreadsheetHandler.run()
CMD->>SVC: update_metadata(srn, parsed_data)
SVC->>REPO: save(deposition)
SVC-->>U: SpreadsheetUploaded
U->>API: POST /depositions/{srn}/files
API->>CMD: UploadFileHandler.run()
CMD->>SVC: upload_file(srn, filename, content, size)
SVC->>CONV: get(convention_srn) [validate file reqs]
SVC->>FS: save_file(srn, filename, content, size)
FS-->>SVC: DepositionFile
SVC->>REPO: save(deposition)
SVC->>OB: append(FileUploadedEvent)
SVC-->>U: FileUploaded
U->>API: POST /depositions/{srn}/submit
API->>CMD: SubmitDepositionHandler.run()
CMD->>SVC: submit(srn)
SVC->>REPO: save(deposition [IN_VALIDATION])
SVC->>OB: append(DepositionSubmittedEvent)
SVC-->>U: DepositionSubmitted
W->>OB: claim(DepositionSubmittedEvent)
W->>VH: handle(event)
VH->>OB: append(ValidationCompleted or ValidationFailed)
Note over W,VH: On ValidationFailed:
W->>OB: claim(ValidationFailed)
W->>REPO: get(deposition)
W->>REPO: save(deposition [DRAFT])
Last reviewed commit: d22a102
| ) -> DepositionFile: | ||
| dep_dir = self._dep_dir(deposition_id) | ||
| dep_dir.mkdir(parents=True, exist_ok=True) | ||
| target = dep_dir / filename |
There was a problem hiding this comment.
Path traversal vulnerability via unsanitized filename
The filename parameter comes directly from user input (file.filename in the upload route) and is used to construct a filesystem path without any sanitization. A malicious user could upload a file with a name like ../../etc/passwd or ../../../important_file to write outside the deposition directory.
This affects save_file, get_file, and delete_file — all of which construct paths as dep_dir / filename.
You should sanitize the filename to strip path separators and ensure the resolved path stays within the deposition directory. For example:
| target = dep_dir / filename | |
| target = dep_dir / Path(filename).name | |
| if not target.resolve().is_relative_to(dep_dir.resolve()): | |
| raise ValueError(f"Invalid filename: {filename}") |
| op.add_column( | ||
| "depositions", | ||
| sa.Column("convention_srn", sa.String(), nullable=True), | ||
| ) |
There was a problem hiding this comment.
Migration/table definition nullability mismatch
The migration adds convention_srn as nullable=True, but the SQLAlchemy table definition in tables.py:29 declares it as nullable=False, and the domain model requires convention_srn as a mandatory field on Deposition. This mismatch means:
- The migration will create the column as nullable in the database
- SQLAlchemy's table metadata believes it's non-nullable
- Any existing deposition rows would need a
convention_srnvalue to satisfy the constraint
If this runs against a database with existing depositions, the ALTER TABLE ADD COLUMN with nullable=True will succeed, but those rows will have NULL for convention_srn, which will cause errors when the ORM tries to load them. The migration should either set a server_default or include a data migration step for existing rows, then alter to NOT NULL.
| op.add_column( | |
| "depositions", | |
| sa.Column("convention_srn", sa.String(), nullable=True), | |
| ) | |
| op.add_column( | |
| "depositions", | |
| sa.Column("convention_srn", sa.String(), nullable=False, server_default=""), | |
| ) |
| async def upload_file( | ||
| self, | ||
| srn: DepositionSRN, | ||
| filename: str, | ||
| content: bytes, | ||
| size: int, | ||
| ) -> Deposition: | ||
| dep = await self.get(srn) | ||
| convention = await self.convention_repo.get(dep.convention_srn) | ||
| if convention is None: | ||
| raise NotFoundError(f"Convention not found: {dep.convention_srn}") | ||
|
|
||
| reqs = convention.file_requirements | ||
|
|
||
| # Validate file type | ||
| ext = _get_extension(filename) | ||
| if reqs.accepted_types and ext not in reqs.accepted_types: | ||
| raise ValidationError(f"File type '{ext}' not accepted. Allowed: {reqs.accepted_types}") | ||
|
|
||
| # Validate file size | ||
| if size > reqs.max_file_size: | ||
| raise ValidationError(f"File size {size} exceeds maximum {reqs.max_file_size}") | ||
|
|
||
| # Validate max count | ||
| if len(dep.files) >= reqs.max_count: | ||
| raise ValidationError( | ||
| f"Maximum {reqs.max_count} files allowed, already have {len(dep.files)}" | ||
| ) | ||
|
|
||
| # Store the file and get back the DepositionFile VO | ||
| saved_file = await self.file_storage.save_file(srn, filename, content, size) | ||
| dep.add_file(saved_file) | ||
| await self.deposition_repo.save(dep) | ||
|
|
||
| event = FileUploadedEvent( | ||
| id=EventId(uuid4()), | ||
| deposition_id=srn, | ||
| filename=saved_file.name, | ||
| size=saved_file.size, | ||
| checksum=saved_file.checksum, | ||
| ) | ||
| await self.outbox.append(event) | ||
| return dep |
There was a problem hiding this comment.
No ownership verification for file operations
The upload_file, delete_file, update_metadata, and submit methods on DepositionService do not verify that the requesting user owns the deposition. The command handlers receive a principal but only check the role (e.g., at_least(Role.DEPOSITOR)), and the service methods accept the SRN without validating ownership.
This means any authenticated user with the DEPOSITOR role can upload files to, delete files from, modify metadata of, or submit another user's deposition by knowing (or guessing) its SRN.
Consider adding an ownership check in the service layer, e.g.:
async def upload_file(self, srn, filename, content, size, *, caller_id: UserId):
dep = await self.get(srn)
if dep.owner_id != caller_id:
raise AuthorizationError("Not the owner of this deposition")
...Or leverage the existing @writes(owner(…)) decorator pattern used in PostgresDepositionRepository.save.
| @router.get("/{srn:path}/files/{filename}") | ||
| async def download_file( | ||
| srn: str, | ||
| filename: str, | ||
| handler: FromDishka[DownloadFileHandler], | ||
| ) -> StreamingResponse: | ||
| result = await handler.run(DownloadFile(srn=DepositionSRN.parse(srn), filename=filename)) | ||
| return StreamingResponse( | ||
| result.stream, | ||
| media_type=result.content_type or "application/octet-stream", | ||
| headers={"Content-Disposition": f'attachment; filename="{result.filename}"'}, | ||
| ) |
There was a problem hiding this comment.
Content-Disposition header injection via unsanitized filename
The filename path parameter and the result.filename value are interpolated directly into the Content-Disposition header without sanitization. A filename containing a " character or newlines could break out of the header value, enabling HTTP response header injection.
Consider sanitizing the filename in the response header, for example by stripping or escaping special characters:
safe_name = result.filename.replace('"', '_').replace('\n', '_').replace('\r', '_')
headers={"Content-Disposition": f'attachment; filename="{safe_name}"'},| ) -> FileUploaded: | ||
| content = await file.read() | ||
| return await handler.run( | ||
| UploadFileCommand( | ||
| srn=DepositionSRN.parse(srn), | ||
| filename=file.filename or "unknown", | ||
| content=content, | ||
| size=len(content), | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Entire file content loaded into memory
await file.read() loads the full uploaded file into memory as a bytes object before passing it to the command. For large data files (the convention max_file_size can be up to ~10 GB based on the mock data), this will cause memory exhaustion and crash the server.
Consider streaming the content to disk in chunks rather than buffering the whole file in memory. The UploadFile object from FastAPI wraps a SpooledTemporaryFile that supports chunked reading.
| srn: str, | ||
| handler: FromDishka[DownloadTemplateHandler], | ||
| ) -> StreamingResponse: | ||
| result = await handler.run(DownloadTemplate(convention_srn=ConventionSRN.parse(srn))) | ||
| return StreamingResponse( | ||
| iter([result.content]), | ||
| media_type="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", | ||
| headers={"Content-Disposition": f'attachment; filename="{result.filename}"'}, | ||
| ) |
There was a problem hiding this comment.
Route order: /{srn:path}/template must be declared before /{srn:path}
FastAPI evaluates routes in declaration order. The GET /{srn:path} route at line 54 uses a path parameter that will match everything including /some-srn/template. Since /{srn:path} is declared after /{srn:path}/template here, this specific router is fine.
However, in depositions.py, the same pattern is used and the GET /{srn:path} route (line 159) is declared after the more specific sub-path routes — which is correct. Just flagging this as something to be aware of if routes are reordered in the future. The current order is correct for this file.
| @@ -48,6 +69,7 @@ async def get_session( | |||
| ) -> AsyncIterable[AsyncSession]: | |||
| async with session_factory() as session: | |||
| yield session | |||
There was a problem hiding this comment.
Auto-commit added to session teardown
This await session.commit() in the session generator is a significant behavioral change — it means every UOW scope now auto-commits when the session is yielded back. This is intentional (it replaces the manual session.commit() calls that were removed from the worker), but it also means that any query-only operations (reads, listing) will trigger a commit, which is unnecessary overhead and could commit unintended writes if a read handler accidentally mutates state.
The worker cleanup removing manual commits makes sense given this change, but consider whether auto-committing for read-only operations is desirable. A common pattern is to only commit when writes are detected, or to use separate read/write session factories.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
- Make convention_srn non-nullable in depositions table migration - Add filename sanitization for Content-Disposition headers to prevent CRLF injection attacks - Implement path traversal protection in LocalFileStorageAdapter by validating filenames and ensuring resolved paths stay within bounds - Add comprehensive tests for security fixes
Summary
Test plan
just test)just lint)pnpm lint && pnpm build)Closes #30