-
Notifications
You must be signed in to change notification settings - Fork 7
Add guidelines on metadata and FAIR principles #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
f17a32d
876d25a
27dc634
cf544f3
5ceebb7
c3b9b44
15fc7d0
6a60305
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,129 @@ | ||
| # The FAIR Principles | ||
|
|
||
| The FAIR Principles are a *"set of standards that connects researchers, publishers, | ||
| and data repositories in Earth, space, and environmental sciences to ... accelerate | ||
| scientific discovery and enhance the integrity, transparency, and reproducibility of | ||
| scientific data on a large scale"* ([COPDESS](https://copdess.org/enabling-fair-data-project/)). | ||
| Essentially, scientific data should be **Findable, Accessible, Interoperable, and Reusable.** | ||
|
|
||
| ## Why FAIR Principles Matter | ||
|
|
||
| More important than knowing *what* the FAIR goals are is understanding *why* they matter. | ||
| The nature of performing science is changing, including shifts in scientific publication | ||
| and peer review. Key changes include: | ||
|
|
||
| - Scientific analyses are being encoded in repeatable, shareable workflows. | ||
| - Publications are moving away from static print documentation to interactive demonstrations online. | ||
|
|
||
| These capabilities rely completely on the availability of the data underpinning the research. | ||
|
|
||
| The FAIR principles were originally articulated by [FORCE11](https://www.force11.org/about), | ||
| an organization founded on the belief that *"semantically enhanced, media-rich digital | ||
| publishing will be more powerful than traditional print media or electronic copies of printed works."* | ||
|
|
||
| FAIR principles have been adopted by researchers, publishers, and data repositories affiliated | ||
| with [COPDESS](https://copdess.org/enabling-fair-data-project/), the Coalition on Publishing Data | ||
| in the Earth and Space Sciences. Partners include: | ||
|
|
||
| - Publications such as *Nature* and *Science* | ||
| - Funding agencies like NASA, USGS, NOAA, and NIH | ||
| - Professional groups like AGU | ||
|
|
||
| For a full list of FAIR partners, see the [COPDESS FAIR Data Project](https://copdess.org/enabling-fair-data-project/). | ||
| To view the list of signatories committed to FAIR data, visit the [Statement of Commitment](https://copdess.org/statement-of-commitment/). | ||
|
|
||
| --- | ||
|
|
||
| ## FAIR Principles | ||
|
|
||
| The following are synopsized descriptions of the FAIR principles, | ||
| adopted from [GO FAIR](https://www.go-fair.org/fair-principles/). | ||
|
|
||
| ### **Findable** | ||
|
|
||
| The first step in (re)using data is to find them. [Metadata](metadata.md) and data should be easy | ||
| to find for both humans and computers. Machine-readable metadata are essential for | ||
| automatic discovery of datasets and services. | ||
|
|
||
| **Findable characteristics include:** | ||
|
|
||
| - Data and metadata are assigned globally unique and persistent identifiers. | ||
| - Data are described by rich metadata that clearly includes the identifier of the data they describe. | ||
| - Data and metadata are registered or indexed in a searchable resource. | ||
|
|
||
| ### **Accessible** | ||
|
|
||
| Once the user finds the required data, they need to know how to access them, including details about authentication and authorization. | ||
|
|
||
| **Characteristics of being accessible include:** | ||
|
|
||
| - Data and metadata are retrievable using common protocols. | ||
| - The protocol is open and free. | ||
| - Authentication and authorization procedures are applied where necessary. | ||
| - Metadata remain accessible even when data are no longer available. | ||
|
|
||
| ### **Interoperable** | ||
|
|
||
| Data usually need to be integrated with other data. Additionally, data must | ||
| interoperate with applications or workflows for analysis, storage, and processing. | ||
|
|
||
| **Interoperable data and metadata:** | ||
|
|
||
| - Use a formal, accessible, broadly applicable language for knowledge representation. | ||
| - Use vocabularies that follow FAIR principles. | ||
| - Include qualified references to other data and metadata. | ||
|
|
||
| ### **Reusable** | ||
|
|
||
| The ultimate goal of FAIR is to optimize data reuse. To achieve this, metadata and data | ||
| must be well-described to enable replication and/or combination in different settings. | ||
|
|
||
| **Reusable data and metadata:** | ||
|
|
||
| - Have clear, accessible data usage licenses. | ||
| - Are associated with detailed provenance. | ||
| - Meet domain-relevant community standards. | ||
|
|
||
| --- | ||
|
|
||
| ## How to Apply FAIR Principles | ||
|
|
||
| 1. **Adopt FAIR-Compliant Practices**: | ||
| - Assign persistent identifiers to datasets and metadata. | ||
| - Use rich metadata that describe datasets thoroughly. | ||
|
|
||
| 2. **Register Metadata**: | ||
| - Index metadata in searchable repositories to enhance discoverability. | ||
|
|
||
| 3. **Implement Standards for Access and Interoperability**: | ||
| - Ensure retrieval protocols are open and free. | ||
| - Use FAIR-aligned vocabularies and knowledge representation languages. | ||
|
|
||
| 4. **Provide Reuse Guidance**: | ||
| - Include detailed provenance information. | ||
| - Apply clear licenses for data usage. | ||
|
|
||
| 5. **Collaborate with FAIR Partners**: | ||
| - Follow practices adopted by FORCE11, COPDESS, and similar organizations. | ||
|
Comment on lines
+91
to
+107
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is strange that the bulleted lists in the previous sections all are indented in the final product on the RtD page (which looks nice!) but these are not. I'm not sure why 🤔
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it might be an issue with mixing ordered lists with unordered lists. When I indent the bullets more, I get the following markdown lint error |
||
|
|
||
| --- | ||
|
|
||
| ## Useful Links | ||
|
|
||
| - [Metadata Overview](metadata.md) | ||
| - [COPDESS FAIR Data Project](https://copdess.org/enabling-fair-data-project/) | ||
| - [Statement of Commitment to FAIR Data](https://copdess.org/statement-of-commitment/) | ||
| - [GO FAIR Principles](https://www.go-fair.org/fair-principles/) | ||
| - [FORCE11](https://www.force11.org/about) | ||
|
|
||
| ## Acronyms | ||
|
|
||
| - **FAIR** = Findable, Accessible, Interoperable, Reusable | ||
| - **FORCE11** = The Future of Research Communication and e-Scholarship | ||
| - **COPDESS** = Coalition on Publishing Data in the Earth and Space Sciences | ||
| - **AGU** = American Geophysical Union | ||
| - **NASA** = National Aeronautics and Space Administration | ||
| - **NIH** = National Institutes of Health | ||
| - **NOAA** = National Oceanic and Atmospheric Administration | ||
|
|
||
| Credit: Content taken from a Confluence guide written by Anne Wilson, and modified by Shawn Polson in 2019 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -5,4 +5,6 @@ Data Management | |
| .. toctree:: | ||
| :maxdepth: 1 | ||
|
|
||
| file_formats/index | ||
| file_formats/index | ||
| metadata.md | ||
| fair_principles.md | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,169 @@ | ||
| # Metadata | ||
|
|
||
| ## Purpose | ||
|
|
||
| Metadata supports data science workflows by: | ||
|
|
||
| - Ensuring datasets are discoverable and usable by both humans and machines. | ||
| - Meeting internal and external policies for data accessibility and preservation. | ||
| - Enhancing collaboration by providing clear and standardized metadata practices. | ||
| - Contributing to the overall success of projects by enabling proper data usage and interoperability. | ||
|
|
||
| ## What is Metadata | ||
|
|
||
| A dataset generally consists of sets of measured or modeled values. However, the values alone are | ||
| insufficient to understand and use that dataset. Consider this example of a very small dataset: | ||
|
|
||
| **Temperature: 31.5** | ||
|
|
||
| The data point “Temperature: 31.5” raises many questions: | ||
|
|
||
| - Temperature of what? | ||
| - According to whom or what? | ||
| - Collected when/where? | ||
| - Measured or calculated? | ||
| - If calculated, how? | ||
| - What units? | ||
| - To what precision? | ||
|
|
||
| To make this dataset FAIR (Findable, Accessible, Interoperable, and Reusable), additional information is needed. | ||
|
|
||
| Metadata is information (data) about a dataset. It includes: | ||
|
|
||
| - Time and spatial coverages and cadences | ||
| - Units | ||
| - Processing level | ||
| - Data quality | ||
| - Instrument details | ||
| - Principal Investigator | ||
| - Provenance | ||
| - Special alerts, etc. | ||
|
|
||
| Ideally, metadata provides all the information necessary to find, understand, | ||
| and use the dataset correctly. Good quality metadata is critical for data to be FAIR. | ||
|
|
||
| ## Benefits of Good Quality Metadata | ||
|
|
||
| Good quality, searchable metadata enables people to find data that fits their needs: | ||
|
|
||
| - **Good quality**: Sufficient information is provided. | ||
| - **Searchable**: Users can find data by various facets like spatial or temporal coverage. | ||
|
|
||
| ## Metadata Storage, Formats, and Access | ||
|
|
||
| ### Storage Options | ||
|
|
||
| The best practices for metadata storage include: | ||
|
|
||
| 1. **Machine-readable metadata** consumable by common tools | ||
| 2. **Publicly accessible metadata** readable by humans. | ||
| 3. **Avoid private, inaccessible formats** like personal notebooks or sticky notes. | ||
|
|
||
| #### Examples of Metadata Storage | ||
|
|
||
| - **Prose embedded in HTML**: Readable by humans but not easily consumable by tools. | ||
| - **Public spreadsheets**: Readable by tools that understand the structure but not widely accessible otherwise. | ||
| - **Self-describing formats**: Examples include: | ||
| - **NetCDF, HDF, FITS**: Include specific metadata properties like variables, geospatial coverage, and time coverage. | ||
| - **Header information** in CSV or ASCII tables: | ||
| - Simple but less machine-readable. | ||
|
|
||
| Machine readability often depends on established metadata conventions, such as | ||
| **Climate and Forecast (CF) conventions** used widely in atmospheric science ([More details here](https://www.unidata.ucar.edu/software/netcdf/workshops/most-recent/cf/index.html)). | ||
|
|
||
| ### LASP Metadata Repository | ||
|
|
||
| LASP is developing the **LASP Extended Metadata Repository (LEMR)** to store and access dataset metadata: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cool! I didn't know about this.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't either! so hopefully this information isn't out-of-date |
||
|
|
||
| - Automates and dynamically accesses essential properties for data services. | ||
| - Plans to extend metadata management capabilities for LASP scientists. | ||
|
|
||
| ## Metadata Formats | ||
|
|
||
| Metadata formats refer to schemas describing the metadata structure. Examples include: | ||
|
|
||
| - **[ISO 19115](https://www.fgdc.gov/metadata/iso-standards)**: Geographic information and services. | ||
| - **[SPASE](https://spdf.gsfc.nasa.gov/spdf-documents/SPASE_and_SPDF.html)**: Used in Heliophysics. | ||
|
|
||
| At LASP, the **laspds schema** is used for applications serving data, with plans to | ||
| integrate with standard schemas like SPASE and ISO 19115. | ||
|
|
||
| ## What Metadata to Save | ||
|
|
||
| ### Key Considerations | ||
|
|
||
| At project inception: | ||
|
|
||
| - Identify essential metadata for understanding and using the dataset. | ||
| - Create a plan to preserve this information. | ||
|
|
||
| ### Balancing Minimal and Comprehensive Metadata | ||
|
|
||
| Repositories often balance between minimal metadata (to lower barriers for participation) | ||
| and sufficient metadata for full dataset understanding. Repositories recognize that providing | ||
| quality metadata takes resources. | ||
|
|
||
| - Example: **CU Scholar** requires: | ||
| - Landing page URL | ||
| - Names of dataset creators | ||
| - Title | ||
| - Publishing organization | ||
| - Resource type | ||
|
|
||
| This information alone would not be sufficient to use a dataset, but it is sufficient | ||
| to allow CU Scholar to serve the dataset. CU Scholar expects additional details | ||
| (e.g., coverages, units, quality indicators) to be available on the landing page or | ||
| via self-describing formats. | ||
|
|
||
| ## Provenance | ||
|
|
||
| The **provenance** of a dataset describes its history and is critical for using datasets correctly: | ||
|
|
||
| - Origin of the data | ||
| - Processing methods | ||
| - Calibration and validation details | ||
| - Software versions used | ||
|
|
||
| Data producers should record: | ||
|
|
||
| - Dataset inputs | ||
| - Processing steps | ||
| - Configuration, calibration, and validation details | ||
|
|
||
| Provenance is often provided as descriptive prose, making machine-readable text a reasonable option. | ||
|
|
||
| **Learn More**: [The Importance of Data Set Provenance for Science](https://eos.org/opinions/the-importance-of-data-set-provenance-for-science). | ||
|
|
||
| ## Summary of Metadata Workflow | ||
|
|
||
| 1. **Identify Necessary Metadata**: | ||
| - At project inception, determine what metadata is essential for understanding and using the dataset. | ||
| 2. **Choose the Appropriate Storage Option**: | ||
| - Use machine-readable formats like NetCDF or HDF where possible. | ||
| - For simpler use cases, include metadata in file headers or spreadsheets, ensuring structure is clear. | ||
| 3. **Follow Metadata Conventions**: | ||
| - Adhere to standards for machine-readability. | ||
| - Consult metadata experts when encoding complex datasets. | ||
| 4. **Leverage LASP’s Tools**: | ||
| - Use the **LASP Extended Metadata Repository (LEMR)** for automated and dynamic metadata management if applicable. | ||
| - Work with LASP administrators to input metadata into LEMR. | ||
| 5. **Maintain Provenance**: | ||
| - Record dataset inputs, processing, calibration, and validation details. | ||
| - Provide descriptive prose or structured metadata to ensure provenance is clear and traceable. | ||
|
|
||
| ## Useful Links | ||
|
|
||
| - [CF Conventions for NetCDF](https://www.unidata.ucar.edu/software/netcdf/workshops/most-recent/cf/index.html) | ||
| - [The Importance of Dataset Provenance for Science](https://eos.org/opinions/the-importance-of-data-set-provenance-for-science) | ||
| - [NASA DOI Landing Page Requirements](https://wiki.earthdata.nasa.gov/display/DOIsforEOSDIS/DOI+Landing+Page) | ||
| - [CU Scholar Metadata Requirements](https://scholar.colorado.edu/faq) | ||
|
|
||
| ## Acronyms | ||
|
|
||
| - **CF** = Climate and Forecast | ||
| - **FAIR** = Findable, Accessible, Interoperable, and Reusable | ||
| - **ISO** = International Organization for Standardization | ||
| - **LEMR** = LASP Extended Metadata Repository | ||
| - **SPASE** = Space Physics Archive Search and Extract | ||
|
|
||
| Credit: Content taken from a Confluence guide written by Anne Wilson and Shawn Polson. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since metadata is mentioned in several places in this document, it might be good to cross-reference your
metadata.mdpage somewhere in here.