Submit ARTIS to KNB Data Repository

Purpose

Long-term stable archive of model inputs, model, database, and metadata
Open-access distribution point for the public and collaborators
Reusable automated creation of EML metadata
Reusable automated upload of database and metadata to KNB

ARTIS uses the The Knowledge Network for Biocomplexity KNB data repository to archive and distribute stable releases of the model codebase and resulting database. Archiving, documenting and openly distributing ARTIS is a critical component in contributing to the larger open-science and reproducible science community. ARTIS uses KNB as an access point for anyone to download the ARTIS model codebase and ARTIS database.

Workflow Outline for Publishing ARTIS dataset on KNB

Assume ARTIS database is validated and in cleaned architecture for distribution (done in )
Update ARTIS data disctionaries and long-lived documentation (if needed)
Generate EML metadata documentation specific to the ARTIS dataset release version
Point to KNB stagging member node
Create data package with EML and full ARTIS database
Push to stagging

Additional KNB Info

KNB is guided by FAIR (findable, accessible, interoperable, resuble) principles of data sharing and preservation and issues unique DOIs (digital object identifier) to each data package and every version of the package for long term access, transparency, and informative citations. KNB is a member of DataONE (Data Observation Network for Earth); a network of data repositories and KNB uses EML (Ecological Metadata Language) to document objects within a data packages and can be authored via the website GUI (graphical user interface) or through a series of R packages; the ARTIS pipeline uses EMLassemblyline

KNB and ADC Data Team Training - For creating a data package to submit to KNB and editing of existing EML documentation.
Instructions for the EML assembly line - Practical instructions for running the EMLassemblyline (EAL) workflow to author EML.
DataOne R package documentation Check out the Vignettes particularly:
- DataONE Federation for KNB Authentication Tokens and;
- Uploading Datasets to DataONE for an outline of the data package upload workflow.

Creating EML Metadata for ARTIS Dataset Releases

This guide walks through generating an EML (Ecological Metadata Language) metadata document for a new ARTIS dataset release on the KNB data repository. The workflow uses the EMLassemblyline R package (EAL) combined with custom post-processing scripts to produce valid EML for the ARTIS parquet file collection.

Relevant File Architecture

knb-submit/
├── run_EMLassemblyline_for_metadata-files.R   # Main workflow script — run this
├── functions/
│   └── ARTIS_EAL_helper_functions.R           # Helper functions sourced by run script
├── tests/
│   └── testthat/
│       └── test-artis-eml-validation.R        # Validation tests for dictionaries & templates
└── metadata-files/
    ├── artis_dictionary_tbl_attributes.txt        # ⭐ Attribute definitions (long-lived)
    ├── artis_dictionary_tbl_attributes_catvars.txt # ⭐ Categorical variable definitions (long-lived)
    ├── artis_dictionary_hs_version.txt            # ⭐ HS version definitions (long-lived)
    ├── artis_dictionary_tbl.txt                   # ⭐ Table-level descriptions (long-lived)
    ├── data_objects/                              # Representative .csv files (auto-generated)
    ├── eml/                                       # Output EML .xml files land here
    └── metadata_templates/
        ├── abstract.md                            # ⭐ Dataset abstract (update each release)
        ├── methods.md                             # ⭐ Dataset methods (update if needed)
        ├── additional_info.md                     # ⭐ Additional dataset info
        ├── keywords.txt                           # ⭐ Dataset keywords
        ├── personnel.txt                          # ⭐ Creator/contact/PI info (update each release)
        ├── intellectual_rights.txt                # License (rarely needs editing)
        ├── attributes_*.txt                       # Auto-generated — do NOT manually edit
        └── catvars_*.txt                          # Auto-generated — do NOT manually edit

⭐ = files you may need to update for a new release. All other files are either auto-generated or rarely change.

Prerequisites

Environment setup

The workflow reads the ARTIS parquet dataset from a local path set in your .Renviron file. Set this once per machine:

usethis::edit_r_environ(scope = "project")

Add this line to the .Renviron file that opens, replacing the path with your local ARTIS dataset location:

ARTIS_DB_PATH=/path/to/your/local/ARTIS/KNB/outputs

Save and restart R. Verify it worked:

Sys.getenv("ARTIS_DB_PATH")

Note

You will need an ORCiD to log into KNB. Create one if you don't have one — it also serves as your author identifier in the metadata.

`EMLassemblyline` Workflow

Overview

The EML generation workflow has three stages:

Update metadata inputs — update dictionaries, templates, and config values for the new release
Run the main script — EMLassemblyline generates EML for representative .csv files; post-processing replaces .csv references with the full parquet file collection
Validate and publish — run tests, validate EML, upload to KNB

Update the script config

Open run_EMLassemblyline_for_metadata-files.R and update the top config section:

# Personal script config
clean_up_templates <- "yes"   # always "yes" when rerunning to avoid stale templates
convert_parquets   <- "yes"   # set to "yes" if the data schema has changed; "no" otherwise
final_eml_name     <- "ARTIS_v1.2_FAO_parquet.xml"  # update version number here

Update the make_eml() call further down to reflect the new release:

EMLassemblyline::make_eml(
  dataset.title    = "Aquatic Resource Trade in Species (ARTIS) v1.3 FAO",  # update version
  temporal.coverage = c("1996", "2021"),  # update end year if coverage changed
  ...
)

Update `abstract.md`

Open metadata-files/metadata_templates/abstract.md in Positron or Rstudio (not Excel) and update the temporal coverage, species counts, or any other release-specific language.

Warning

Do not use special characters, symbols, or formatting. EML only accepts Unicode plain text: UTF-8. URLs are acceptable. The run_EMLassemblyline_for_metadata-files.R script trys to reads in .txt safely because excel introduces all kinds of crazy things without the user knowing. LLM generated text may also use non-UTF-8 characters that will introduce EML validation problems.

Update `personnel.txt`

Open metadata-files/metadata_templates/personnel.txt in a spreadsheet editor and verify or update author/contact information.

Key rules for this file:

At least one creator and one contact must be listed — these are required by EML
userId must be the 16-digit ORCiD number only, formatted as XXXX-XXXX-XXXX-XXXX — not the full URL
Valid role values from the EAL documentation: creator, contact, PI, metadataProvider. Any other string is also accepted and will appear as an associated party. Note these are not EML valid values, EAL has its own set that gets translated in make_eml().
If a person has more than one role, duplicate their row with the second role. One row per role.

Update data dictionaries (only if columns or tables changed)

The four data dictionary files in metadata-files/ are designed to persist across releases. Only update them if:

New columns were added to a table → add rows to artis_dictionary_tbl_attributes.txt
Categorical values changed → update artis_dictionary_tbl_attributes_catvars.txt
A new HS version is included → update artis_dictionary_hs_version.txt
A new table was added → add rows to both attribute and table-level dictionaries. (This might be more complicated and require changing the general table listings within the run_EMLasseblyline_for_metadata-files.R.

Tip

Open dictionary files in a spreadsheet editor if edits are needed. The sanitize_encoding() function in the run script automatically cleans up any encoding artifacts introduced by Excel on save — this is expected and handled.

Valid values for key columns in artis_dictionary_tbl_attributes.txt:

Column	Valid values
`class`	`numeric`, `categorical`, `character`, `Date`
`unit`	Required when `class == "numeric"`. Use `dimensionless` if no units apply. Must be blank for non-numeric. Run `EMLassemblyline::view_unit_dictionary()` to find valid unit names.
`dateTimeFormatString`	Required when `class == "Date"`. Use format codes: `YYYY`, `MM`, `DD`, `hh`, `mm`, `ss`. Must be blank for non-Date.
`missingValueCode`	One value per attribute (e.g. `NA`).

Run the Main Script

In run_EMLassemblyline_for_metadata-files.R, set convert_parquets <- "yes" to convert new dataset version. Set to "no" if only re-running for same dataset.

Run the full script:

source("run_EMLassemblyline_for_metadata-files.R")

The script will:

Delete stale attributes_*.txt and catvars_*.txt template files
Read in the ARTIS data dictionaries with encoding sanitization
Convert representative parquet files to .csv (if convert_parquets == "yes")
Run EMLassemblyline template functions to generate attribute and categorical variable templates
Join your dictionary definitions into the templates
Call EMLassemblyline::make_eml() to produce an initial EML .xml file describing the 8 ARTIS representative .csv files of the general table types.
Post-process the EML: the 8 representative .csv <dataTable> elements are replaced with <dataTable> elements describing the full collection of n parquet files. Each parquet file is matched to its ARTIS table type (e.g. consumption, trade, reference_hs6) and cloned from the corresponding representative <dataTable> template — preserving the full <attributeList> column definitions. Only the <physical> (file name, size, format), <entityName>, and <entityDescription> fields are updated to reflect each individual parquet file. This means the single representative consumption and trade templates are each stamped across all of their respective partitioned parquet files (split by HS version and year), while each reference table template is cloned once.
Write the final EML to metadata-files/eml/ARTIS_v1.2_FAO_parquet.xml
Validate the EML — you should see [1] TRUE with no errors

If validation fails, the error message will point to the invalid section.

Check Dictionaries and Templates

Before publishing, run the dictionary and template checks. These are not a substitute for formal EML validation (which runs automatically at the end of the main script) — they are a supplementary check designed specifically for the ARTIS workflow. Because EMLassemblyline uses its own set of valid values that differ from raw EML schema values, these tests verify that the long-lived ARTIS data dictionaries stay aligned with what EMLassemblyline expects when it reads the generated attributes_*.txt and catvars_*.txt template files.

This is particularly useful to run after editing any of the dictionary files before re-running the main script:

testthat::test_file("tests/testthat/test-artis-eml-validation.R")

The checks confirm:

All class values in the ARTIS dictionaries and generated attribute templates are valid EAL values (numeric, categorical, character, Date)
Numeric attributes have units; non-numeric attributes do not
Date attributes have a dateTimeFormatString; non-Date attributes do not
All attributeDefinition and categorical definition fields are non-empty
No non-UTF-8 encoding artifacts remain in dictionary character columns
personnel.txt contains at least one creator, contact, PI, and metadataProvider
All ORCiD userId values are formatted as XXXX-XXXX-XXXX-XXXX

Fix any failures in the source dictionary files and re-run the main script before proceeding to publish.

Note: KNB will assign a new package identifier in the production environment — you cannot reuse the staging identifier. Re-run the script one final time with the production identifier before the final upload.

AM KNB Datapackage Notes for README

Will create new dataset version on KNB each release. Will NOT use versioned releases to update an existing dataset.
Instructions to get access token
- Log into KNB with ORCiD
- my profile
- settings
- Authentication
- Token for DataONE R
- Copy
- run in console
CRAN dataone pkg vignettes

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
artis-1.0-files		artis-1.0-files
artis-1.2-files		artis-1.2-files
functions		functions
metadata-files		metadata-files
tests/testthat		tests/testthat
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
publish_ARTIS_on_KNB.R		publish_ARTIS_on_KNB.R
run_EMLassemblyline_for_metadata-files.R		run_EMLassemblyline_for_metadata-files.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Submit ARTIS to KNB Data Repository

Purpose

Workflow Outline for Publishing ARTIS dataset on KNB

Additional KNB Info

Creating EML Metadata for ARTIS Dataset Releases

Relevant File Architecture

Prerequisites

Environment setup

`EMLassemblyline` Workflow

Overview

Update the script config

Update `abstract.md`

Update `personnel.txt`

Update data dictionaries (only if columns or tables changed)

Run the Main Script

Check Dictionaries and Templates

AM KNB Datapackage Notes for README

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Submit ARTIS to KNB Data Repository

Purpose

Workflow Outline for Publishing ARTIS dataset on KNB

Additional KNB Info

Creating EML Metadata for ARTIS Dataset Releases

Relevant File Architecture

Prerequisites

Environment setup

EMLassemblyline Workflow

Overview

Update the script config

Update abstract.md

Update personnel.txt

Update data dictionaries (only if columns or tables changed)

Run the Main Script

Check Dictionaries and Templates

AM KNB Datapackage Notes for README

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`EMLassemblyline` Workflow

Update `abstract.md`

Update `personnel.txt`

Packages