Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions docs/source/data_management/fair_principles.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since metadata is mentioned in several places in this document, it might be good to cross-reference your metadata.md page somewhere in here.

Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# The FAIR Principles

The FAIR Principles are a *"set of standards that connects researchers, publishers,
and data repositories in Earth, space, and environmental sciences to ... accelerate
scientific discovery and enhance the integrity, transparency, and reproducibility of
scientific data on a large scale"* ([COPDESS](https://copdess.org/enabling-fair-data-project/)).
Essentially, scientific data should be **Findable, Accessible, Interoperable, and Reusable.**

## Why FAIR Principles Matter

More important than knowing *what* the FAIR goals are is understanding *why* they matter.
The nature of performing science is changing, including shifts in scientific publication
and peer review. Key changes include:

- Scientific analyses are being encoded in repeatable, shareable workflows.
- Publications are moving away from static print documentation to interactive demonstrations online.

These capabilities rely completely on the availability of the data underpinning the research.

The FAIR principles were originally articulated by [FORCE11](https://www.force11.org/about),
an organization founded on the belief that *"semantically enhanced, media-rich digital
publishing will be more powerful than traditional print media or electronic copies of printed works."*

FAIR principles have been adopted by researchers, publishers, and data repositories affiliated
with [COPDESS](https://copdess.org/enabling-fair-data-project/), the Coalition on Publishing Data
in the Earth and Space Sciences. Partners include:

- Publications such as *Nature* and *Science*
- Funding agencies like NASA, USGS, NOAA, and NIH
- Professional groups like AGU

For a full list of FAIR partners, see the [COPDESS FAIR Data Project](https://copdess.org/enabling-fair-data-project/).
To view the list of signatories committed to FAIR data, visit the [Statement of Commitment](https://copdess.org/statement-of-commitment/).

---

## FAIR Principles

The following are synopsized descriptions of the FAIR principles,
adopted from [GO FAIR](https://www.go-fair.org/fair-principles/).

### **Findable**

The first step in (re)using data is to find them. [Metadata](metadata.md) and data should be easy
to find for both humans and computers. Machine-readable metadata are essential for
automatic discovery of datasets and services.

**Findable characteristics include:**

- Data and metadata are assigned globally unique and persistent identifiers.
- Data are described by rich metadata that clearly includes the identifier of the data they describe.
- Data and metadata are registered or indexed in a searchable resource.

### **Accessible**

Once the user finds the required data, they need to know how to access them, including details about authentication and authorization.

**Characteristics of being accessible include:**

- Data and metadata are retrievable using common protocols.
- The protocol is open and free.
- Authentication and authorization procedures are applied where necessary.
- Metadata remain accessible even when data are no longer available.

### **Interoperable**

Data usually need to be integrated with other data. Additionally, data must
interoperate with applications or workflows for analysis, storage, and processing.

**Interoperable data and metadata:**

- Use a formal, accessible, broadly applicable language for knowledge representation.
- Use vocabularies that follow FAIR principles.
- Include qualified references to other data and metadata.

### **Reusable**

The ultimate goal of FAIR is to optimize data reuse. To achieve this, metadata and data
must be well-described to enable replication and/or combination in different settings.

**Reusable data and metadata:**

- Have clear, accessible data usage licenses.
- Are associated with detailed provenance.
- Meet domain-relevant community standards.

---

## How to Apply FAIR Principles

1. **Adopt FAIR-Compliant Practices**:
- Assign persistent identifiers to datasets and metadata.
- Use rich metadata that describe datasets thoroughly.

2. **Register Metadata**:
- Index metadata in searchable repositories to enhance discoverability.

3. **Implement Standards for Access and Interoperability**:
- Ensure retrieval protocols are open and free.
- Use FAIR-aligned vocabularies and knowledge representation languages.

4. **Provide Reuse Guidance**:
- Include detailed provenance information.
- Apply clear licenses for data usage.

5. **Collaborate with FAIR Partners**:
- Follow practices adopted by FORCE11, COPDESS, and similar organizations.
Comment on lines +91 to +107
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is strange that the bulleted lists in the previous sections all are indented in the final product on the RtD page (which looks nice!) but these are not. I'm not sure why 🤔

Copy link
Collaborator Author

@vmartinez-cu vmartinez-cu Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be an issue with mixing ordered lists with unordered lists. When I indent the bullets more, I get the following markdown lint error MD007 Unordered list indentation


---

## Useful Links

- [Metadata Overview](metadata.md)
- [COPDESS FAIR Data Project](https://copdess.org/enabling-fair-data-project/)
- [Statement of Commitment to FAIR Data](https://copdess.org/statement-of-commitment/)
- [GO FAIR Principles](https://www.go-fair.org/fair-principles/)
- [FORCE11](https://www.force11.org/about)

## Acronyms

- **FAIR** = Findable, Accessible, Interoperable, Reusable
- **FORCE11** = The Future of Research Communication and e-Scholarship
- **COPDESS** = Coalition on Publishing Data in the Earth and Space Sciences
- **AGU** = American Geophysical Union
- **NASA** = National Aeronautics and Space Administration
- **NIH** = National Institutes of Health
- **NOAA** = National Oceanic and Atmospheric Administration

Credit: Content taken from a Confluence guide written by Anne Wilson, and modified by Shawn Polson in 2019
4 changes: 3 additions & 1 deletion docs/source/data_management/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,6 @@ Data Management
.. toctree::
:maxdepth: 1

file_formats/index
file_formats/index
metadata.md
fair_principles.md
169 changes: 169 additions & 0 deletions docs/source/data_management/metadata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# Metadata

## Purpose

Metadata supports data science workflows by:

- Ensuring datasets are discoverable and usable by both humans and machines.
- Meeting internal and external policies for data accessibility and preservation.
- Enhancing collaboration by providing clear and standardized metadata practices.
- Contributing to the overall success of projects by enabling proper data usage and interoperability.

## What is Metadata

A dataset generally consists of sets of measured or modeled values. However, the values alone are
insufficient to understand and use that dataset. Consider this example of a very small dataset:

**Temperature: 31.5**

The data point “Temperature: 31.5” raises many questions:

- Temperature of what?
- According to whom or what?
- Collected when/where?
- Measured or calculated?
- If calculated, how?
- What units?
- To what precision?

To make this dataset FAIR (Findable, Accessible, Interoperable, and Reusable), additional information is needed.

Metadata is information (data) about a dataset. It includes:

- Time and spatial coverages and cadences
- Units
- Processing level
- Data quality
- Instrument details
- Principal Investigator
- Provenance
- Special alerts, etc.

Ideally, metadata provides all the information necessary to find, understand,
and use the dataset correctly. Good quality metadata is critical for data to be FAIR.

## Benefits of Good Quality Metadata

Good quality, searchable metadata enables people to find data that fits their needs:

- **Good quality**: Sufficient information is provided.
- **Searchable**: Users can find data by various facets like spatial or temporal coverage.

## Metadata Storage, Formats, and Access

### Storage Options

The best practices for metadata storage include:

1. **Machine-readable metadata** consumable by common tools
2. **Publicly accessible metadata** readable by humans.
3. **Avoid private, inaccessible formats** like personal notebooks or sticky notes.

#### Examples of Metadata Storage

- **Prose embedded in HTML**: Readable by humans but not easily consumable by tools.
- **Public spreadsheets**: Readable by tools that understand the structure but not widely accessible otherwise.
- **Self-describing formats**: Examples include:
- **NetCDF, HDF, FITS**: Include specific metadata properties like variables, geospatial coverage, and time coverage.
- **Header information** in CSV or ASCII tables:
- Simple but less machine-readable.

Machine readability often depends on established metadata conventions, such as
**Climate and Forecast (CF) conventions** used widely in atmospheric science ([More details here](https://www.unidata.ucar.edu/software/netcdf/workshops/most-recent/cf/index.html)).

### LASP Metadata Repository

LASP is developing the **LASP Extended Metadata Repository (LEMR)** to store and access dataset metadata:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! I didn't know about this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't either! so hopefully this information isn't out-of-date


- Automates and dynamically accesses essential properties for data services.
- Plans to extend metadata management capabilities for LASP scientists.

## Metadata Formats

Metadata formats refer to schemas describing the metadata structure. Examples include:

- **[ISO 19115](https://www.fgdc.gov/metadata/iso-standards)**: Geographic information and services.
- **[SPASE](https://spdf.gsfc.nasa.gov/spdf-documents/SPASE_and_SPDF.html)**: Used in Heliophysics.

At LASP, the **laspds schema** is used for applications serving data, with plans to
integrate with standard schemas like SPASE and ISO 19115.

## What Metadata to Save

### Key Considerations

At project inception:

- Identify essential metadata for understanding and using the dataset.
- Create a plan to preserve this information.

### Balancing Minimal and Comprehensive Metadata

Repositories often balance between minimal metadata (to lower barriers for participation)
and sufficient metadata for full dataset understanding. Repositories recognize that providing
quality metadata takes resources.

- Example: **CU Scholar** requires:
- Landing page URL
- Names of dataset creators
- Title
- Publishing organization
- Resource type

This information alone would not be sufficient to use a dataset, but it is sufficient
to allow CU Scholar to serve the dataset. CU Scholar expects additional details
(e.g., coverages, units, quality indicators) to be available on the landing page or
via self-describing formats.

## Provenance

The **provenance** of a dataset describes its history and is critical for using datasets correctly:

- Origin of the data
- Processing methods
- Calibration and validation details
- Software versions used

Data producers should record:

- Dataset inputs
- Processing steps
- Configuration, calibration, and validation details

Provenance is often provided as descriptive prose, making machine-readable text a reasonable option.

**Learn More**: [The Importance of Data Set Provenance for Science](https://eos.org/opinions/the-importance-of-data-set-provenance-for-science).

## Summary of Metadata Workflow

1. **Identify Necessary Metadata**:
- At project inception, determine what metadata is essential for understanding and using the dataset.
2. **Choose the Appropriate Storage Option**:
- Use machine-readable formats like NetCDF or HDF where possible.
- For simpler use cases, include metadata in file headers or spreadsheets, ensuring structure is clear.
3. **Follow Metadata Conventions**:
- Adhere to standards for machine-readability.
- Consult metadata experts when encoding complex datasets.
4. **Leverage LASP’s Tools**:
- Use the **LASP Extended Metadata Repository (LEMR)** for automated and dynamic metadata management if applicable.
- Work with LASP administrators to input metadata into LEMR.
5. **Maintain Provenance**:
- Record dataset inputs, processing, calibration, and validation details.
- Provide descriptive prose or structured metadata to ensure provenance is clear and traceable.

## Useful Links

- [CF Conventions for NetCDF](https://www.unidata.ucar.edu/software/netcdf/workshops/most-recent/cf/index.html)
- [The Importance of Dataset Provenance for Science](https://eos.org/opinions/the-importance-of-data-set-provenance-for-science)
- [NASA DOI Landing Page Requirements](https://wiki.earthdata.nasa.gov/display/DOIsforEOSDIS/DOI+Landing+Page)
- [CU Scholar Metadata Requirements](https://scholar.colorado.edu/faq)

## Acronyms

- **CF** = Climate and Forecast
- **FAIR** = Findable, Accessible, Interoperable, and Reusable
- **ISO** = International Organization for Standardization
- **LEMR** = LASP Extended Metadata Repository
- **SPASE** = Space Physics Archive Search and Extract

Credit: Content taken from a Confluence guide written by Anne Wilson and Shawn Polson.
Loading