diff --git a/docs/source/data_management/fair_principles.md b/docs/source/data_management/fair_principles.md new file mode 100644 index 0000000..61b1b27 --- /dev/null +++ b/docs/source/data_management/fair_principles.md @@ -0,0 +1,129 @@ +# The FAIR Principles + +The FAIR Principles are a *"set of standards that connects researchers, publishers, +and data repositories in Earth, space, and environmental sciences to ... accelerate +scientific discovery and enhance the integrity, transparency, and reproducibility of +scientific data on a large scale"* ([COPDESS](https://copdess.org/enabling-fair-data-project/)). +Essentially, scientific data should be **Findable, Accessible, Interoperable, and Reusable.** + +## Why FAIR Principles Matter + +More important than knowing *what* the FAIR goals are is understanding *why* they matter. +The nature of performing science is changing, including shifts in scientific publication +and peer review. Key changes include: + +- Scientific analyses are being encoded in repeatable, shareable workflows. +- Publications are moving away from static print documentation to interactive demonstrations online. + +These capabilities rely completely on the availability of the data underpinning the research. + +The FAIR principles were originally articulated by [FORCE11](https://www.force11.org/about), +an organization founded on the belief that *"semantically enhanced, media-rich digital +publishing will be more powerful than traditional print media or electronic copies of printed works."* + +FAIR principles have been adopted by researchers, publishers, and data repositories affiliated +with [COPDESS](https://copdess.org/enabling-fair-data-project/), the Coalition on Publishing Data +in the Earth and Space Sciences. Partners include: + +- Publications such as *Nature* and *Science* +- Funding agencies like NASA, USGS, NOAA, and NIH +- Professional groups like AGU + +For a full list of FAIR partners, see the [COPDESS FAIR Data Project](https://copdess.org/enabling-fair-data-project/). +To view the list of signatories committed to FAIR data, visit the [Statement of Commitment](https://copdess.org/statement-of-commitment/). + +--- + +## FAIR Principles + +The following are synopsized descriptions of the FAIR principles, +adopted from [GO FAIR](https://www.go-fair.org/fair-principles/). + +### **Findable** + +The first step in (re)using data is to find them. [Metadata](metadata.md) and data should be easy +to find for both humans and computers. Machine-readable metadata are essential for +automatic discovery of datasets and services. + +**Findable characteristics include:** + +- Data and metadata are assigned globally unique and persistent identifiers. +- Data are described by rich metadata that clearly includes the identifier of the data they describe. +- Data and metadata are registered or indexed in a searchable resource. + +### **Accessible** + +Once the user finds the required data, they need to know how to access them, including details about authentication and authorization. + +**Characteristics of being accessible include:** + +- Data and metadata are retrievable using common protocols. +- The protocol is open and free. +- Authentication and authorization procedures are applied where necessary. +- Metadata remain accessible even when data are no longer available. + +### **Interoperable** + +Data usually need to be integrated with other data. Additionally, data must +interoperate with applications or workflows for analysis, storage, and processing. + +**Interoperable data and metadata:** + +- Use a formal, accessible, broadly applicable language for knowledge representation. +- Use vocabularies that follow FAIR principles. +- Include qualified references to other data and metadata. + +### **Reusable** + +The ultimate goal of FAIR is to optimize data reuse. To achieve this, metadata and data +must be well-described to enable replication and/or combination in different settings. + +**Reusable data and metadata:** + +- Have clear, accessible data usage licenses. +- Are associated with detailed provenance. +- Meet domain-relevant community standards. + +--- + +## How to Apply FAIR Principles + +1. **Adopt FAIR-Compliant Practices**: + - Assign persistent identifiers to datasets and metadata. + - Use rich metadata that describe datasets thoroughly. + +2. **Register Metadata**: + - Index metadata in searchable repositories to enhance discoverability. + +3. **Implement Standards for Access and Interoperability**: + - Ensure retrieval protocols are open and free. + - Use FAIR-aligned vocabularies and knowledge representation languages. + +4. **Provide Reuse Guidance**: + - Include detailed provenance information. + - Apply clear licenses for data usage. + +5. **Collaborate with FAIR Partners**: + - Follow practices adopted by FORCE11, COPDESS, and similar organizations. + +--- + +## Useful Links + +- [Metadata Overview](metadata.md) +- [COPDESS FAIR Data Project](https://copdess.org/enabling-fair-data-project/) +- [Statement of Commitment to FAIR Data](https://copdess.org/statement-of-commitment/) +- [GO FAIR Principles](https://www.go-fair.org/fair-principles/) +- [FORCE11](https://www.force11.org/about) + +## Acronyms + +- **FAIR** = Findable, Accessible, Interoperable, Reusable +- **FORCE11** = The Future of Research Communication and e-Scholarship +- **COPDESS** = Coalition on Publishing Data in the Earth and Space Sciences +- **AGU** = American Geophysical Union +- **NASA** = National Aeronautics and Space Administration +- **NIH** = National Institutes of Health +- **NOAA** = National Oceanic and Atmospheric Administration + +Credit: Content taken from a Confluence guide written by Anne Wilson, and modified by Shawn Polson in 2019 \ No newline at end of file diff --git a/docs/source/data_management/index.rst b/docs/source/data_management/index.rst index aa55ee4..e0a7f04 100644 --- a/docs/source/data_management/index.rst +++ b/docs/source/data_management/index.rst @@ -5,4 +5,6 @@ Data Management .. toctree:: :maxdepth: 1 - file_formats/index \ No newline at end of file + file_formats/index + metadata.md + fair_principles.md \ No newline at end of file diff --git a/docs/source/data_management/metadata.md b/docs/source/data_management/metadata.md new file mode 100644 index 0000000..be82b40 --- /dev/null +++ b/docs/source/data_management/metadata.md @@ -0,0 +1,169 @@ +# Metadata + +## Purpose + +Metadata supports data science workflows by: + +- Ensuring datasets are discoverable and usable by both humans and machines. +- Meeting internal and external policies for data accessibility and preservation. +- Enhancing collaboration by providing clear and standardized metadata practices. +- Contributing to the overall success of projects by enabling proper data usage and interoperability. + +## What is Metadata + +A dataset generally consists of sets of measured or modeled values. However, the values alone are +insufficient to understand and use that dataset. Consider this example of a very small dataset: + +**Temperature: 31.5** + +The data point “Temperature: 31.5” raises many questions: + +- Temperature of what? +- According to whom or what? +- Collected when/where? +- Measured or calculated? +- If calculated, how? +- What units? +- To what precision? + +To make this dataset FAIR (Findable, Accessible, Interoperable, and Reusable), additional information is needed. + +Metadata is information (data) about a dataset. It includes: + +- Time and spatial coverages and cadences +- Units +- Processing level +- Data quality +- Instrument details +- Principal Investigator +- Provenance +- Special alerts, etc. + +Ideally, metadata provides all the information necessary to find, understand, +and use the dataset correctly. Good quality metadata is critical for data to be FAIR. + +## Benefits of Good Quality Metadata + +Good quality, searchable metadata enables people to find data that fits their needs: + +- **Good quality**: Sufficient information is provided. +- **Searchable**: Users can find data by various facets like spatial or temporal coverage. + +## Metadata Storage, Formats, and Access + +### Storage Options + +The best practices for metadata storage include: + +1. **Machine-readable metadata** consumable by common tools +2. **Publicly accessible metadata** readable by humans. +3. **Avoid private, inaccessible formats** like personal notebooks or sticky notes. + +#### Examples of Metadata Storage + +- **Prose embedded in HTML**: Readable by humans but not easily consumable by tools. +- **Public spreadsheets**: Readable by tools that understand the structure but not widely accessible otherwise. +- **Self-describing formats**: Examples include: + - **NetCDF, HDF, FITS**: Include specific metadata properties like variables, geospatial coverage, and time coverage. + - **Header information** in CSV or ASCII tables: + - Simple but less machine-readable. + +Machine readability often depends on established metadata conventions, such as +**Climate and Forecast (CF) conventions** used widely in atmospheric science ([More details here](https://www.unidata.ucar.edu/software/netcdf/workshops/most-recent/cf/index.html)). + +### LASP Metadata Repository + +LASP is developing the **LASP Extended Metadata Repository (LEMR)** to store and access dataset metadata: + +- Automates and dynamically accesses essential properties for data services. +- Plans to extend metadata management capabilities for LASP scientists. + +## Metadata Formats + +Metadata formats refer to schemas describing the metadata structure. Examples include: + +- **[ISO 19115](https://www.fgdc.gov/metadata/iso-standards)**: Geographic information and services. +- **[SPASE](https://spdf.gsfc.nasa.gov/spdf-documents/SPASE_and_SPDF.html)**: Used in Heliophysics. + +At LASP, the **laspds schema** is used for applications serving data, with plans to +integrate with standard schemas like SPASE and ISO 19115. + +## What Metadata to Save + +### Key Considerations + +At project inception: + +- Identify essential metadata for understanding and using the dataset. +- Create a plan to preserve this information. + +### Balancing Minimal and Comprehensive Metadata + +Repositories often balance between minimal metadata (to lower barriers for participation) +and sufficient metadata for full dataset understanding. Repositories recognize that providing +quality metadata takes resources. + +- Example: **CU Scholar** requires: + - Landing page URL + - Names of dataset creators + - Title + - Publishing organization + - Resource type + +This information alone would not be sufficient to use a dataset, but it is sufficient +to allow CU Scholar to serve the dataset. CU Scholar expects additional details +(e.g., coverages, units, quality indicators) to be available on the landing page or +via self-describing formats. + +## Provenance + +The **provenance** of a dataset describes its history and is critical for using datasets correctly: + +- Origin of the data +- Processing methods +- Calibration and validation details +- Software versions used + +Data producers should record: + +- Dataset inputs +- Processing steps +- Configuration, calibration, and validation details + +Provenance is often provided as descriptive prose, making machine-readable text a reasonable option. + +**Learn More**: [The Importance of Data Set Provenance for Science](https://eos.org/opinions/the-importance-of-data-set-provenance-for-science). + +## Summary of Metadata Workflow + +1. **Identify Necessary Metadata**: + - At project inception, determine what metadata is essential for understanding and using the dataset. +2. **Choose the Appropriate Storage Option**: + - Use machine-readable formats like NetCDF or HDF where possible. + - For simpler use cases, include metadata in file headers or spreadsheets, ensuring structure is clear. +3. **Follow Metadata Conventions**: + - Adhere to standards for machine-readability. + - Consult metadata experts when encoding complex datasets. +4. **Leverage LASP’s Tools**: + - Use the **LASP Extended Metadata Repository (LEMR)** for automated and dynamic metadata management if applicable. + - Work with LASP administrators to input metadata into LEMR. +5. **Maintain Provenance**: + - Record dataset inputs, processing, calibration, and validation details. + - Provide descriptive prose or structured metadata to ensure provenance is clear and traceable. + +## Useful Links + +- [CF Conventions for NetCDF](https://www.unidata.ucar.edu/software/netcdf/workshops/most-recent/cf/index.html) +- [The Importance of Dataset Provenance for Science](https://eos.org/opinions/the-importance-of-data-set-provenance-for-science) +- [NASA DOI Landing Page Requirements](https://wiki.earthdata.nasa.gov/display/DOIsforEOSDIS/DOI+Landing+Page) +- [CU Scholar Metadata Requirements](https://scholar.colorado.edu/faq) + +## Acronyms + +- **CF** = Climate and Forecast +- **FAIR** = Findable, Accessible, Interoperable, and Reusable +- **ISO** = International Organization for Standardization +- **LEMR** = LASP Extended Metadata Repository +- **SPASE** = Space Physics Archive Search and Extract + +Credit: Content taken from a Confluence guide written by Anne Wilson and Shawn Polson. \ No newline at end of file