Skip to content

feat(dwc): export backend — 25 API endpoints, cache engine, EML, RSS#7938

Open
foozleface wants to merge 1 commit intospecify:mainfrom
calacademy-research:cas/dwc-backend
Open

feat(dwc): export backend — 25 API endpoints, cache engine, EML, RSS#7938
foozleface wants to merge 1 commit intospecify:mainfrom
calacademy-research:cas/dwc-backend

Conversation

@foozleface
Copy link
Copy Markdown
Collaborator

Contributed by @foozleface

Implements the full server-side DwC export pipeline: 21 new API endpoints (25 total with the 4 existing), a cache engine for pre-built query results, two archive generation paths, EML/RSS support, and a 217-term DwC vocabulary. Builds on the schema models from the DwC schema PR.

Implementation

  • 25 API endpoints (specifyweb/backend/export/urls.py) covering mapping CRUD (create_mapping, update_mapping, delete_mapping, save_mapping_fields, clone_mapping, create_mapping_from_query), dataset CRUD (create_dataset, update_dataset, delete_dataset, clone_dataset), archive generation (generate_dwca, build_cache, cache_status), queries (list_queries, list_mappings, list_export_datasets), schema terms (schema_terms), OccurrenceID validation (validate_occurrence_ids), EML preview (preview_eml), RSS feed (rss_feed, download_feed), and feed updates (force_update, force_update_packages).
  • Direct query execution path (dwca_from_mapping.py) — executes the backing SpQuery live, writes CSV into a DwCA ZIP. Supports core + extension mappings with coreid linking.
  • Cache-first path (dwca_from_cache.py) — reads from pre-built cache tables for faster archive generation when data has not changed.
  • Cache engine (cache.py) — creates/drops/rebuilds MySQL cache tables using stream-and-batch-insert. Tracks build status (idle/building/error) in CacheTableMeta. Uses build_query() from the stored queries engine via SQLAlchemy.
  • Shared utilities (dwca_utils.py) — build_meta_xml() and build_eml_xml() generate standards-compliant meta.xml and eml.xml for the archive. Term name sanitization for safe column/file names.
  • Field adapter (field_adapter.py) — bridges Django Spqueryfield (lowercase attrs) to the EphemeralField interface (camelCase) expected by QueryField.from_spqueryfield() in the stored queries engine.
  • Schema terms vocabulary (schema_terms.json) — 217 DwC terms across 14 groups (Occurrence, Event, Location, Taxon, Identification, GeologicalContext, Record-level, Organism, MeasurementOrFact, ResourceRelationship, MaterialEntity, MaterialSample, Media, Record), each with suggested Specify mapping paths.
  • Default mappings (default_mappings.py) — pre-configured field sets for common DwC profiles.
  • OccurrenceID uniqueness validation — endpoint checks for duplicate GUIDs before archive generation.
  • Path traversal protection on download_feed — validates filenames against directory escape.
  • RSS feed rewrite — now driven by ExportDataSet.isrss flag instead of XML config files, with IPT-compatible <ipt:eml> and <ipt:dwca> elements.
  • Test suite (tests/) — unit tests for models, cache operations, archive generation, attachment URLs, and feed output.

Note: Depends on the DwC schema PR being merged first.

This is part of the DwC export pipeline addressing issues #7709-#7748 (40 GitHub issues for Darwin Core Archive support).

Testing instructions

  • Apply the DwC schema migrations first (from the schema PR)
  • Run the test suite: python manage.py test specifyweb.backend.export
  • Against a real Specify database: create a mapping via POST /export/create_mapping/, assign DwC terms via POST /export/save_mapping_fields/<id>/, then generate an archive via POST /export/generate_dwca/<id>/
  • Verify the generated ZIP contains valid meta.xml, eml.xml, and occurrence.csv
  • Test GET /export/schema_terms/ returns the full vocabulary
  • Test GET /export/validate_occurrence_ids/<id>/ catches duplicate GUIDs
  • Test GET /export/rss_feed/ returns valid RSS XML for datasets with isRss=true
  • Verify GET /export/download_feed/../etc/passwd returns 404 (path traversal protection)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 📋Back Log

Development

Successfully merging this pull request may close these issues.

1 participant