- Long-term stable archive of model inputs, model, database, and metadata
- Open-access distribution point for the public and collaborators
- Reusable automated creation of EML metadata
- Reusable automated upload of database and metadata to KNB
ARTIS uses the The Knowledge Network for Biocomplexity KNB data repository to archive and distribute stable releases of the model codebase and resulting database. Archiving, documenting and openly distributing ARTIS is a critical component in contributing to the larger open-science and reproducible science community. ARTIS uses KNB as an access point for anyone to download the ARTIS model codebase and ARTIS database.
- Assume ARTIS database is validated and in cleaned architecture for distribution (done in )
- Update ARTIS data disctionaries and long-lived documentation (if needed)
- Generate EML metadata documentation specific to the ARTIS dataset release version
- Point to KNB stagging member node
- Create data package with EML and full ARTIS database
- Push to stagging
KNB is guided by FAIR (findable, accessible, interoperable, resuble) principles of data sharing and preservation and issues unique DOIs (digital object identifier) to each data package and every version of the package for long term access, transparency, and informative citations. KNB is a member of DataONE (Data Observation Network for Earth); a network of data repositories and KNB uses EML (Ecological Metadata Language) to document objects within a data packages and can be authored via the website GUI (graphical user interface) or through a series of R packages; the ARTIS pipeline uses EMLassemblyline
- KNB and ADC Data Team Training - For creating a data package to submit to KNB and editing of existing EML documentation.
- Instructions for the EML assembly line - Practical instructions for running the
EMLassemblyline(EAL) workflow to author EML. - DataOne R package documentation Check out the
Vignettesparticularly:- DataONE Federation for KNB Authentication Tokens and;
- Uploading Datasets to DataONE for an outline of the data package upload workflow.
This guide walks through generating an EML (Ecological Metadata Language) metadata document for a new ARTIS dataset release on the KNB data repository. The workflow uses the EMLassemblyline R package (EAL) combined with custom post-processing scripts to produce valid EML for the ARTIS parquet file collection.
knb-submit/
├── run_EMLassemblyline_for_metadata-files.R # Main workflow script — run this
├── functions/
│ └── ARTIS_EAL_helper_functions.R # Helper functions sourced by run script
├── tests/
│ └── testthat/
│ └── test-artis-eml-validation.R # Validation tests for dictionaries & templates
└── metadata-files/
├── artis_dictionary_tbl_attributes.txt # ⭐ Attribute definitions (long-lived)
├── artis_dictionary_tbl_attributes_catvars.txt # ⭐ Categorical variable definitions (long-lived)
├── artis_dictionary_hs_version.txt # ⭐ HS version definitions (long-lived)
├── artis_dictionary_tbl.txt # ⭐ Table-level descriptions (long-lived)
├── data_objects/ # Representative .csv files (auto-generated)
├── eml/ # Output EML .xml files land here
└── metadata_templates/
├── abstract.md # ⭐ Dataset abstract (update each release)
├── methods.md # ⭐ Dataset methods (update if needed)
├── additional_info.md # ⭐ Additional dataset info
├── keywords.txt # ⭐ Dataset keywords
├── personnel.txt # ⭐ Creator/contact/PI info (update each release)
├── intellectual_rights.txt # License (rarely needs editing)
├── attributes_*.txt # Auto-generated — do NOT manually edit
└── catvars_*.txt # Auto-generated — do NOT manually edit
⭐ = files you may need to update for a new release. All other files are either auto-generated or rarely change.
The workflow reads the ARTIS parquet dataset from a local path set in your .Renviron file. Set this once per machine:
usethis::edit_r_environ(scope = "project")Add this line to the .Renviron file that opens, replacing the path with your local ARTIS dataset location:
ARTIS_DB_PATH=/path/to/your/local/ARTIS/KNB/outputsSave and restart R. Verify it worked:
Sys.getenv("ARTIS_DB_PATH")Note
You will need an ORCiD to log into KNB. Create one if you don't have one — it also serves as your author identifier in the metadata.
The EML generation workflow has three stages:
- Update metadata inputs — update dictionaries, templates, and config values for the new release
- Run the main script —
EMLassemblylinegenerates EML for representative.csvfiles; post-processing replaces.csvreferences with the full parquet file collection - Validate and publish — run tests, validate EML, upload to KNB
Open run_EMLassemblyline_for_metadata-files.R and update the top config section:
# Personal script config
clean_up_templates <- "yes" # always "yes" when rerunning to avoid stale templates
convert_parquets <- "yes" # set to "yes" if the data schema has changed; "no" otherwise
final_eml_name <- "ARTIS_v1.2_FAO_parquet.xml" # update version number hereUpdate the make_eml() call further down to reflect the new release:
EMLassemblyline::make_eml(
dataset.title = "Aquatic Resource Trade in Species (ARTIS) v1.3 FAO", # update version
temporal.coverage = c("1996", "2021"), # update end year if coverage changed
...
)Open metadata-files/metadata_templates/abstract.md in Positron or Rstudio (not Excel) and update the temporal coverage, species counts, or any other release-specific language.
Warning
Do not use special characters, symbols, or formatting. EML only accepts Unicode plain text: UTF-8. URLs are acceptable. The run_EMLassemblyline_for_metadata-files.R script trys to reads in .txt safely because excel introduces all kinds of crazy things without the user knowing. LLM generated text may also use non-UTF-8 characters that will introduce EML validation problems.
Open metadata-files/metadata_templates/personnel.txt in a spreadsheet editor and verify or update author/contact information.
Key rules for this file:
- At least one
creatorand onecontactmust be listed — these are required by EML userIdmust be the 16-digit ORCiD number only, formatted asXXXX-XXXX-XXXX-XXXX— not the full URL- Valid
rolevalues from the EAL documentation:creator,contact,PI,metadataProvider. Any other string is also accepted and will appear as an associated party. Note these are notEMLvalid values, EAL has its own set that gets translated inmake_eml(). - If a person has more than one role, duplicate their row with the second role. One row per role.
The four data dictionary files in metadata-files/ are designed to persist across releases. Only update them if:
- New columns were added to a table → add rows to
artis_dictionary_tbl_attributes.txt - Categorical values changed → update
artis_dictionary_tbl_attributes_catvars.txt - A new HS version is included → update
artis_dictionary_hs_version.txt - A new table was added → add rows to both attribute and table-level dictionaries. (This might be more complicated and require changing the general table listings within the
run_EMLasseblyline_for_metadata-files.R.
Tip
Open dictionary files in a spreadsheet editor if edits are needed. The sanitize_encoding() function in the run script automatically cleans up any encoding artifacts introduced by Excel on save — this is expected and handled.
Valid values for key columns in artis_dictionary_tbl_attributes.txt:
| Column | Valid values |
|---|---|
class |
numeric, categorical, character, Date |
unit |
Required when class == "numeric". Use dimensionless if no units apply. Must be blank for non-numeric. Run EMLassemblyline::view_unit_dictionary() to find valid unit names. |
dateTimeFormatString |
Required when class == "Date". Use format codes: YYYY, MM, DD, hh, mm, ss. Must be blank for non-Date. |
missingValueCode |
One value per attribute (e.g. NA). |
In run_EMLassemblyline_for_metadata-files.R, set convert_parquets <- "yes" to convert new dataset version. Set to "no" if only re-running for same dataset.
Run the full script:
source("run_EMLassemblyline_for_metadata-files.R")The script will:
- Delete stale
attributes_*.txtandcatvars_*.txttemplate files - Read in the ARTIS data dictionaries with encoding sanitization
- Convert representative parquet files to
.csv(ifconvert_parquets == "yes") - Run
EMLassemblylinetemplate functions to generate attribute and categorical variable templates - Join your dictionary definitions into the templates
- Call
EMLassemblyline::make_eml()to produce an initial EML.xmlfile describing the 8 ARTIS representative.csvfiles of the general table types. - Post-process the EML: the 8 representative
.csv<dataTable>elements are replaced with<dataTable>elements describing the full collection ofnparquet files. Each parquet file is matched to its ARTIS table type (e.g.consumption,trade,reference_hs6) and cloned from the corresponding representative<dataTable>template — preserving the full<attributeList>column definitions. Only the<physical>(file name, size, format),<entityName>, and<entityDescription>fields are updated to reflect each individual parquet file. This means the single representativeconsumptionandtradetemplates are each stamped across all of their respective partitioned parquet files (split by HS version and year), while each reference table template is cloned once. - Write the final EML to
metadata-files/eml/ARTIS_v1.2_FAO_parquet.xml - Validate the EML — you should see
[1] TRUEwith no errors
If validation fails, the error message will point to the invalid section.
Before publishing, run the dictionary and template checks. These are not a substitute for formal EML validation (which runs automatically at the end of the main script) — they are a supplementary check designed specifically for the ARTIS workflow. Because EMLassemblyline uses its own set of valid values that differ from raw EML schema values, these tests verify that the long-lived ARTIS data dictionaries stay aligned with what EMLassemblyline expects when it reads the generated attributes_*.txt and catvars_*.txt template files.
This is particularly useful to run after editing any of the dictionary files before re-running the main script:
testthat::test_file("tests/testthat/test-artis-eml-validation.R")The checks confirm:
- All
classvalues in the ARTIS dictionaries and generated attribute templates are valid EAL values (numeric,categorical,character,Date) - Numeric attributes have units; non-numeric attributes do not
Dateattributes have adateTimeFormatString; non-Date attributes do not- All
attributeDefinitionand categoricaldefinitionfields are non-empty - No non-UTF-8 encoding artifacts remain in dictionary character columns
personnel.txtcontains at least onecreator,contact,PI, andmetadataProvider- All ORCiD
userIdvalues are formatted asXXXX-XXXX-XXXX-XXXX
Fix any failures in the source dictionary files and re-run the main script before proceeding to publish.
Note: KNB will assign a new package identifier in the production environment — you cannot reuse the staging identifier. Re-run the script one final time with the production identifier before the final upload.
- Will create new dataset version on KNB each release. Will NOT use versioned releases to update an existing dataset.
- Instructions to get access token
- Log into KNB with ORCiD
- my profile
- settings
- Authentication
- Token for DataONE R
- Copy
- run in console
- CRAN dataone pkg vignettes