feat(dwc): export backend — 25 API endpoints, cache engine, EML, RSS by foozleface · Pull Request #7938 · specify/specify7

foozleface · 2026-04-09T17:14:55Z

Implements the full server-side DwC export pipeline: 21 new API endpoints (25 total with the 4 existing), a cache engine for pre-built query results, two archive generation paths, EML/RSS support, and a 217-term DwC vocabulary. Builds on the schema models from the DwC schema PR.

Implementation

25 API endpoints (specifyweb/backend/export/urls.py) covering mapping CRUD (create_mapping, update_mapping, delete_mapping, save_mapping_fields, clone_mapping, create_mapping_from_query), dataset CRUD (create_dataset, update_dataset, delete_dataset, clone_dataset), archive generation (generate_dwca, build_cache, cache_status), queries (list_queries, list_mappings, list_export_datasets), schema terms (schema_terms), OccurrenceID validation (validate_occurrence_ids), EML preview (preview_eml), RSS feed (rss_feed, download_feed), and feed updates (force_update, force_update_packages).
Direct query execution path (dwca_from_mapping.py) — executes the backing SpQuery live, writes CSV into a DwCA ZIP. Supports core + extension mappings with coreid linking.
Cache-first path (dwca_from_cache.py) — reads from pre-built cache tables for faster archive generation when data has not changed.
Cache engine (cache.py) — creates/drops/rebuilds MySQL cache tables using stream-and-batch-insert. Tracks build status (idle/building/error) in CacheTableMeta. Uses build_query() from the stored queries engine via SQLAlchemy.
Shared utilities (dwca_utils.py) — build_meta_xml() and build_eml_xml() generate standards-compliant meta.xml and eml.xml for the archive. Term name sanitization for safe column/file names.
Field adapter (field_adapter.py) — bridges Django Spqueryfield (lowercase attrs) to the EphemeralField interface (camelCase) expected by QueryField.from_spqueryfield() in the stored queries engine.
Schema terms vocabulary (schema_terms.json) — 217 DwC terms across 14 groups (Occurrence, Event, Location, Taxon, Identification, GeologicalContext, Record-level, Organism, MeasurementOrFact, ResourceRelationship, MaterialEntity, MaterialSample, Media, Record), each with suggested Specify mapping paths.
Default mappings (default_mappings.py) — pre-configured field sets for common DwC profiles.
OccurrenceID uniqueness validation — endpoint checks for duplicate GUIDs before archive generation.
Path traversal protection on download_feed — validates filenames against directory escape.
RSS feed rewrite — now driven by ExportDataSet.isrss flag instead of XML config files, with IPT-compatible <ipt:eml> and <ipt:dwca> elements.
Test suite (tests/) — unit tests for models, cache operations, archive generation, attachment URLs, and feed output.

Note: Depends on the DwC schema PR being merged first.

This is part of the DwC export pipeline addressing issues #7709-#7748 (40 GitHub issues for Darwin Core Archive support).

Testing instructions

Apply the DwC schema migrations first (from the schema PR)
Run the test suite: python manage.py test specifyweb.backend.export
Against a real Specify database: create a mapping via POST /export/create_mapping/, assign DwC terms via POST /export/save_mapping_fields/<id>/, then generate an archive via POST /export/generate_dwca/<id>/
Verify the generated ZIP contains valid meta.xml, eml.xml, and occurrence.csv
Test GET /export/schema_terms/ returns the full vocabulary
Test GET /export/validate_occurrence_ids/<id>/ catches duplicate GUIDs
Test GET /export/rss_feed/ returns valid RSS XML for datasets with isRss=true
Verify GET /export/download_feed/../etc/passwd returns 404 (path traversal protection)

feat: DwC export backend — 25 API endpoints, cache engine, EML, RSS

5a528b7

github-project-automation bot added this to General Tester Board Apr 9, 2026

github-project-automation bot moved this to 📋Back Log in General Tester Board Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dwc): export backend — 25 API endpoints, cache engine, EML, RSS#7938

feat(dwc): export backend — 25 API endpoints, cache engine, EML, RSS#7938
foozleface wants to merge 1 commit intospecify:mainfrom
calacademy-research:cas/dwc-backend

foozleface commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

foozleface commented Apr 9, 2026

Implementation

Testing instructions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant