diff --git a/.markdownlint.json b/.markdownlint.json new file mode 100644 index 0000000..8d238b9 --- /dev/null +++ b/.markdownlint.json @@ -0,0 +1,5 @@ +{ + "MD007": { "indent": 4 }, + "no-hard-tabs": false, + "MD013": false +} \ No newline at end of file diff --git a/docs/index.md b/docs/index.md index 8e3103f..a77dfa6 100644 --- a/docs/index.md +++ b/docs/index.md @@ -15,7 +15,7 @@ Check out our guides to get your project off on the right foot! - [The Hugging Face Repo Guide](wiki-guide/Hugging-Face-Repo-Guide.md): Analogous expected and suggested repository contents for Hugging Face repositories; there are notable differences from GitHub in both content and structure. -- [Metadata Guide](wiki-guide/Metadata-Guide.md): Guide to metadata collection and documentation. This closely follows our [HF Dataset Card Template](wiki-guide/HF_DatasetCard_Template_mkdocs.md) sections. +- [FAIR Guide](wiki-guide/FAIR-Guide.md): Guide to producing FAIR digital products, from metadata collection through product documentation and publication. This builds on the content in both the GitHub and Hugging Face Repository Guides, providing checklists to ensure [code](wiki-guide/Code-Checklist.md), [data](wiki-guide/Data-Checklist.md), and [model](wiki-guide/Model-Checklist.md) repositories are FAIR. The latter two closely follow our [HF Templates](wiki-guide/About-Templates.md). ### Project repo up, what's next? Check out our workflow guides for how to interact with your new repo: diff --git a/docs/wiki-guide/Code-Checklist.md b/docs/wiki-guide/Code-Checklist.md new file mode 100644 index 0000000..b3c3569 --- /dev/null +++ b/docs/wiki-guide/Code-Checklist.md @@ -0,0 +1,107 @@ +# Code Checklist + +This checklist provides an overview of essential and recommended elements to include in a GitHub repository to ensure that it conforms to FAIR principles and best practices for reproducibility. Along with the generation of a DOI (see [DOI Generation](DOI-Generation.md) and [Digital Products Release and Licensing Policy](Digital-products-release-licensing-policy.md)), following this checklist ensures compliance with the FAIR Principles for research software.[^1] +[^1]: Barker, M., Chue Hong, N. P., Katz, D. S., Lamprecht, A. L., Martinez-Ortiz, C., Psomopoulos, F., Harrow, J., Castro, L. J., Gruenpeter, M., Martinez, P. A., & Honeyman, T. (2022). Introducing the FAIR Principles for research software. _Scientific data_, 9(1), 622. [URL](https://doi.org/10.1038/s41597-022-01710-x). + +!!! tip "Pro tip" + + Use the eye icon at the top of this page to access the source and copy the markdown for the checklist below into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each element to your GitHub repository. + +## Required Files + +- [ ] **License**: Verify and include an appropriate license (e.g., `MIT`, `CC0-1.0`, etc.). See discussion in the [Repo Guide](GitHub-Repo-Guide.md/#license). +- [ ] **README File**: Following the [Repo Guide](GitHub-Repo-Guide.md/#readme), provide a detailed `README.md` with: + - [ ] Overview of the project. + - [ ] Installation instructions. + - [ ] Basic usage examples. + - [ ] Links to related/created dataset(s). + - [ ] Links to related/created model(s). + - [ ] Acknowledge source code dependencies and contributors. + - [ ] Reference related datasets used in training or evaluation. +- [ ] **Requirements File**: Provide a [file detailing software requirements](GitHub-Repo-Guide.md/#software-requirements-file), such as a `requirements.txt` or `pyproject.toml` for Python dependencies. +- [ ] **Gitignore File**: GitHub has premade `.gitignore` files ([here](https://github.com/github/gitignore)) tailored to particular languages (eg., [R](https://github.com/github/gitignore/blob/main/R.gitignore) or [Python](https://github.com/github/gitignore/blob/main/Python.gitignore)), operating systems, etc. +- [ ] **CITATION CFF**: This facilitates citation of your work, follow guidance provided in the [Repo Guide](GitHub-Repo-Guide.md/#citation). + +### Data-Related + +- [ ] Preprocessing code. +- [ ] Description of dataset(s), including description of training and testing sets (with links to relevant portions of dataset card, which will have more information). + +### Model-Related + +- [ ] Training code. +- [ ] Inference/evaluation code. +- [ ] Model weights (if not in Hugging Face model repository). +- [ ] Description of model(s)/benchmark(s). +- [ ] Explanation of training and testing (with links to relevant portions of model card, which will have more information). + +!!! note + The [bioclip GitHub repository](https://github.com/Imageomics/bioclip) provides an example of incorporating data-and model-related code into a GitHub repository as published open-source code for both data and model development. + +## General Information + +- [ ] **Repository Structure**: Ensure the code repository follows a clear and logical directory structure. (See [Repo Guide](GitHub-Repo-Guide.md/#general-repository-structure).) +- [ ] **Code Comments**: Include meaningful inline comments and function descriptions for clarity. +- [ ] **Random Seed Control**: Save seed(s) for random number generator(s) to ensure reproducible results. + +## Security Considerations + +- [ ] **Sensitive Data Handling**: Ensure no hardcoded sensitive information (e.g., API keys, credentials) are included in your repository. These can be shared through a config file on OSC. + +!!! note + The best practices described below will help you meet the above requirements. The more advanced development practices noted further down are included for educational purposes and are highly recommended—though these may go beyond what is expected for a given project, we advise collaborators to at least have a discussion about the topics covered in [Code Quality](#code-quality) and whether other practices discussed would be appropriate for their project. + +--- + +## Best Practices + +The [Repo Guide](GitHub-Repo-Guide.md/) provides general guidance on repository structure, [collaborative workflow](The-GitHub-Workflow.md/), and [how to make and review pull requests (PR)](The-GitHub-Pull-Request-Guide.md/). Below, we highlight some best practices in checklist form to help you meet the requirements described above for a FAIR and Reproducible project. + +### Reproducibility + +- **Version Control**: Use Git for version control and commit regularly. +- **Modularization**: Structure code into reusable and independent modules. +- **Code Execution**: Provide Notebooks to demonstrate how to reproduce results. + +### Code Review & Maintenance + +- **Code Reviews**: Regular peer reviews for quality assurance. Refer to the [GitHub PR Review Guide](The-GitHub-Pull-Request-Guide.md/#2-review-a-pull-request). +- **Issue Tracking**: Use GitHub issues for tracking bugs and feature requests. +- **Versioning**: Tag releases, changelogs can be auto-generated and informative when PRs are appropriately scoped. + +### Installation and Dependencies + +- [ ] **Environment Setup**: Include setup instructions (e.g., `conda` environment file, `Dockerfile`). +- [ ] **Dependency Management**: Use virtual environments and the frameworks that manage them (e.g., `venv`, `conda`, `uv` for Python) to isolate dependencies. + +--- + +## More Advanced Development + +### Documentation + +- [ ] **API Documentation**: Generate API documentation (e.g., [`MkDocs`](https://www.mkdocs.org) for Python or wiki pages in the repo). +- [ ] **Docstrings**: Add comprehensive docstrings for all functions, classes, and modules. These can be incorporated to help generate documentation. Note that generative AI tools with access to your code, such as GitHub Copilot, can be quite accurate in generating these, especially if you are using type annotations. +- [ ] **Example Scripts**: Include example scripts for common use cases. +- [ ] **Configuration Files**: Use `yaml`, `json`, or `ini` for configuration settings. + +### Code Quality + +- [ ] **Consistent Style**: Follow coding style guidelines (e.g., `PEP 8` for Python). +- [ ] **Linting**: Ensure the code passes a linter (e.g., `Ruff` for Python). +- [ ] **Logging**: Use logging instead of print statements for better debugging (e.g., `logging` in Python). +- [ ] **Error Handling**: Implement robust exception handling to avoid crashes or bogus results from input outside of code expectations. + +### Testing + +- [ ] **Unit Tests**: Write unit tests to validate core functionality. +- [ ] **Integration Tests**: Ensure components work together correctly. +- [ ] **Test Coverage**: Check test coverage, e.g., using [Coverage](https://coverage.readthedocs.io/). +- [ ] **Continuous Integration (CI)**: Set up CI/CD pipelines (e.g., [GitHub Actions](https://docs.github.com/en/actions)) for automated testing. + +### Code Distribution & Deployment + +- [ ] **Packaging**: Provide installation instructions (e.g., `setup.py`, `hatch`, `poetry`, `uv` for Python). +- [ ] **Deployment Guide**: Document deployment procedures + +!!! question "[Questions, Comments, or Concerns?](https://github.com/Imageomics/Imageomics-guide/issues)" diff --git a/docs/wiki-guide/DOI-Generation.md b/docs/wiki-guide/DOI-Generation.md index 9430c9b..281f983 100644 --- a/docs/wiki-guide/DOI-Generation.md +++ b/docs/wiki-guide/DOI-Generation.md @@ -1,31 +1,28 @@ # DOI Generation This guide discusses DOI generation for digital artifacts that may be associated with publications, such as datasets, models, and software. -You are likely familiar with DOIs from citing (journal/arXiv/conference) papers, for which they are generated by the publisher and regularly used in citations. However, they are also invaluable for proper citation of code, models, and data. One may think of this in the manner they are handled on arXiv, where there are options for "Cite as:" or "for this version" (with the "v#" at the end) option when citing a preprint. +You are likely familiar with DOIs from citing (journal/arXiv/conference) papers, for which they are generated by the publisher and regularly used in citations. However, they are also invaluable for proper citation of code, models, and data. Similar to how DOIs help track different versions of preprints on repositories like arXiv, they can provide persistent identification and versioning for your research artifacts beyond traditional publications. ## What is a DOI? -A DOI (Digital Object Identifier) is a _persistent_ (permanent) digital identifier for any object (data, model, code, etc.) that _uniquely_ distinguishes it from other objects and links to information—metadata—about the object. The International DOI Foundation (IDF) is responsible for developing and administering the DOI system. See their [What is a DOI](https://www.doi.org/the-identifier/what-is-a-doi/) article for more information. - +A DOI (Digital Object Identifier) is a _persistent_ (permanent) digital identifier for any object (data, model, code, etc.) that _uniquely_ distinguishes it from other objects and links to information—metadata—about the object. The International DOI Foundation (IDF) is responsible for developing and administering the DOI system. See their [What is a DOI?](https://www.doi.org/the-identifier/what-is-a-doi/) article for more information. ## How do you generate a DOI? When publishing code, data, or models, there are various options for DOI generation, and selecting one is generally dependent on where the object of interest is published. We will go over the two standard methods used by the Institute here, and we mention a third option for completeness. A comparison of these three options is provided in the [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf). - ### 1. Generate a DOI on Hugging Face -This is the simplest method for generating a DOI for a model or dataset since [Hugging Face partnered with DataCite to offer this option](https://huggingface.co/blog/introducing-doi). +This is the simplest method for generating a DOI for a model or dataset since [Hugging Face partnered with DataCite to offer this option](https://huggingface.co/blog/introducing-doi). !!! warning "Warning" - Though it is a very simple process, it is not one to be taken lightly, as there is no removing data once this has been done--any changes require generation of a ***new*** DOI for the updated version: the old version will be maintained in perpetuity! + Though it is a very simple process, it is not one to be taken lightly, as there is no removing data once this has been done--any changes require generation of a _**new**_ DOI for the updated version: the old version will be maintained in perpetuity! !!! warning "Warning" As stated in the [Imageomics Digital Products Release and Licensing Policy](Digital-products-release-licensing-policy.md), DOIs are not to be generated for Imageomics Organization Repositories until approval has been granted by the Senior Data Scientist or Institute Leadership. Hugging Face allows for the generation of a DOI through the settings tab on the Model or Dataset. For details on _how_ to generate a DOI with Hugging Face, please see the [Hugging Face DOI Documentation](https://huggingface.co/docs/hub/doi). - ### 2. Generate a DOI with Zenodo This is the most common method used for generating a DOI for a GitHub repository, because [Zenodo](https://zenodo.org/) has a [GitHub integration](https://zenodo.org/account/settings/github/), which is accessed through your Zenodo account settings (for more information, please see [GitHub's associated Docs](https://docs.github.com/articles/referencing-and-citing-content)). Zenodo can also be used to generate DOIs for data, as is relatively common in biology. However, for direct use of ML models and datasets, there are many more advantages to using Hugging Face; please see the [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) for more information.[^1] @@ -38,11 +35,11 @@ When your GitHub and Zenodo accounts are linked, there will be a list of availab ![Zenodo instructions and enabled repos](images/doi-generation/enabled_repos+intstructions.png){ loading=lazy, width="800" } !!! info "The Sync now button" - There is a "Sync now" button at the top right of the instructions, with information on when the last sync occurred. Observe that a badge appears for the enabled repository that _has_ a DOI, while the one without just shows up as enabled; this will also be true for repositories to which you have access but that you did not submit to Zenodo yourself. + There is a "Sync now" button at the top right of the instructions, with information on when the last sync occurred. Observe that a badge appears for the enabled repository that **_has_** a DOI, while the one without just shows up as enabled; this will also be true for repositories to which you have access but that you did not submit to Zenodo yourself. #### Metadata Tracking -When automatically generating a DOI with Zenodo, it uses information provided in your `CITATION.cff` file to populate the metadata for the record. However, there is important information that is not supported through this integration despite its inclusion in the `CITATION.cff` format in some cases. +When automatically generating a DOI with Zenodo, it uses information provided in your `CITATION.cff` file to populate the metadata for the record. However, there is important information that is not supported through this integration despite its inclusion in the `CITATION.cff` format in some cases. If your repository is likely to be updated repeatedly (i.e., generating new releases), then you may consider adding a `.zenodo.json` to preserve the remaining metadata on release sync with Zenodo for DOI. This metadata includes grant (funding) information, references (which may be included in your `CITATION.cff`), and a description of your repository/code. @@ -70,8 +67,8 @@ Building on the alternate edit options, there is also the option to simply gener When creating a new record on Zenodo, please ensure that other members of your project have access, as appropriate. In particular, there should be at least one member of Institute leadership or the Senior Data Scientist added to the record with management permissions. This ensures the ability to maintain the metadata and address matters related to the record (which may extend beyond your tenure with the Institute) in a timely manner. - ### 3. Generate a DOI with Dryad [Dryad](https://datadryad.org/stash/about) is another research data repository, similar to Zenodo, through which one can archive digital objects (such as, but not limited to, data) supporting scholarly publications, and obtain a DOI. It has a review process when depositing data and requires dedication to the public domain (CC0) of all digital objects uploaded. Imageomics through OSU is a member organization of Dryad, reducing or eliminating data deposit charge(s). To determine whether Dryad is a suitable archive for Institute data products supporting your publication, please consider the [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) for more information, and consult with the Institute's Senior Data Scientist.[^1] +!!! question "[Questions, Comments, or Concerns?](https://github.com/Imageomics/Imageomics-guide/issues)" diff --git a/docs/wiki-guide/Data-Checklist.md b/docs/wiki-guide/Data-Checklist.md new file mode 100644 index 0000000..15fb52f --- /dev/null +++ b/docs/wiki-guide/Data-Checklist.md @@ -0,0 +1,110 @@ +# Dataset Card Checklist + +Below is a checklist encompassing all sections of a dataset card. Review notes and guidance provided in the full [dataset card template](HF_DatasetCard_Template_mkdocs.md/) for more details. + +!!! tip "Pro tip" + + Use the eye icon at the top of this page to access the source and copy the markdown for the checklist below into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each element to your [dataset card](HF_DatasetCard_Template_mkdocs.md). + +## General Information + +- [ ] **License**: Verify and specify the license type (e.g., `cc0-1.0`). +- [ ] **Language**: Indicate the language(s) (e.g., `en`). +- [ ] **Pretty Name**: Provide a descriptive name for the dataset. +- [ ] **Task Categories**: List relevant task categories (e.g., image-classification). Refer to [the coarse-grained taxonomy of task categories as well as subtasks in this file](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/pipelines.ts). +- [ ] **Tags**: Include relevant tags (e.g., `biology`, `image`, `animals`, `CV`). +- [ ] **Size Categories**: Specify dataset size (e.g., `n<1k`, `1k.png`, each within a folder named for the species. They are 1024 x 1024, and the color has been standardized using ``. +- [ ] **Data Fields**: Describe the types of the data files or the columns in a CSV with metadata ([example](HF_DatasetCard_Template_mkdocs.md/#__codelineno-0-114)). +- [ ] **Data Splits**: Describe any splits (e.g., train, test, validation). + +--- + +## Dataset Creation + +Refer to examples and explanations provided in the full [dataset card template](HF_DatasetCard_Template_mkdocs.md/#__codelineno-0-129). + +- [ ] **Curation Rationale**: Explain why this dataset was created. +- [ ] **Source Data**: Describe the source data. + - [ ] **Data Collection and Processing**: Describe data creation, selection, filtering, normalization, and tools used. + - [ ] **Source Data Producers**: List original data producers or sources. +- [ ] **Annotations**: Include details on annotations. + - [ ] **Annotation Process**: Describe the process and tools used. + - [ ] **Annotators**: List the annotators if applicable. +- [ ] **Personal and Sensitive Information**: Indicate any sensitive information in the dataset. + +--- + +## Considerations for Using the Data + +There are several things to consider while working with the dataset that should be reported to users. For instance, maybe there are hybrids and they are labeled in the `hybrid_stat` column, so to get a subset without hybrids, subset to all instances in the metadata file such that `hybrid_stat` is _not_ "hybrid". + +- [ ] **Bias, Risks, and Limitations**: Describe any known issues with the dataset. For instance, if your data exhibits a long-tailed distribution (and why). +- [ ] **Recommendations**: Provide recommendations for using the dataset responsibly. +- [ ] **Reporting issues**: Provide a link to the issue tracker or other mechanism for reporting problems (e.g. mislabeling, corrupted images, etc.). This can simply be the Community tab for the repository or Issues on the associated GitHub repository. + +--- + +## Licensing Information + +See discussion and references in the [template](HF_DatasetCard_Template_mkdocs.md/#__codelineno-0-19), also remember the [digital product release and licensing policy](Digital-products-release-licensing-policy.md/). + +- [ ] **Licensing Details**: Confirm and list all licensing details. + +--- + +## Citation + +- [ ] **Data Citation**: Provide a BibTeX citation for the dataset. +- [ ] **Associated Paper Citation**: Provide a BibTeX citation for any associated papers. + +--- + +## Acknowledgements + +- [ ] **Acknowledgements**: Include funding or support acknowledgments. + +--- + +## Glossary (Optional) + +- [ ] **Glossary**: Provide definitions for relevant terms or calculations. + +--- + +## More Information (Optional) + +- [ ] **Additional Information**: Add any other relevant information. + +--- + +## Dataset Card Authors + +- [ ] **Authors**: List the authors of the dataset card. + +--- + +## Dataset Card Contact + +- [ ] **Contact Information**: [OPTIONAL] We recommend people use HF discussions, but you may indicate a person to contact. + +!!! question "[Questions, Comments, or Concerns?](https://github.com/Imageomics/Imageomics-guide/issues)" diff --git a/docs/wiki-guide/Digital-products-release-licensing-policy.md b/docs/wiki-guide/Digital-products-release-licensing-policy.md index f16d10b..15316f6 100644 --- a/docs/wiki-guide/Digital-products-release-licensing-policy.md +++ b/docs/wiki-guide/Digital-products-release-licensing-policy.md @@ -24,7 +24,7 @@ This means the following policy applies for digital products of the Imageomics I - For ML-ready datasets, for storage, version control, and sharing we recommend using [Hugging Face Dataset Hub](https://huggingface.co/docs/hub/datasets-overview), which provides for rich metadata description in the form of a [Dataset Card](HF_DatasetCard_Template_mkdocs.md). (See [Imageomics datasets](https://huggingface.co/imageomics) published there as examples.) - - Refer to the Imageomics [Hugging Face](Hugging-Face-Repo-Guide.md) and [Metadata](Metadata-Guide.md) guides for best-practices and further guidance. + - Refer to the Imageomics [Hugging Face](Hugging-Face-Repo-Guide.md) and [FAIR](FAIR-Guide.md) guides for best-practices and further guidance. 4. ML models are to be released under an [OSI-approved open source license](https://opensource.org/licenses/) or to the public domain (for example, by applying a [CC-Zero](https://creativecommons.org/publicdomain/zero/1.0/) waiver). In the case of potentially sensitive models or data (e.g., endangered species information), an Open [Responsible AI License](https://www.licenses.ai/ai-licenses) ([Open RAIL-M](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses)) may be considered. diff --git a/docs/wiki-guide/FAIR-Guide.md b/docs/wiki-guide/FAIR-Guide.md new file mode 100644 index 0000000..36ef567 --- /dev/null +++ b/docs/wiki-guide/FAIR-Guide.md @@ -0,0 +1,33 @@ +# FAIR Guide + +This section provides information and resources to help ensure that digital products are ***F***indable, ***A***ccessible, ***I***nteroperable, ***R***eusable, and Reproducible[^1]. A general [Metadata Checklist](Metadata-Checklist.md) is provided to stimulate thinking about the type of information to be collected. Additionally, we include checklists for [code](Code-Checklist.md), [data](Data-Checklist.md), and [model](Model-Checklist.md) repositories. The code checklist focuses on the contents of a well-documented GitHub repository, while the data and model checklists cover the content of the [data](HF_DatasetCard_Template_mkdocs.md/) and [model](HF_ModelCard_Template_mkdocs.md/) card templates, respectively. + +Each checklist was developed following the FAIR principles (as defined by the [Go-FAIR Initiative](https://www.go-fair.org/fair-principles/)). They provide a detailed outline of tasks and files to include to ensure alignment with the FAIR principles, and are complementary to the descriptions provided within the [GitHub](GitHub-Repo-Guide.md) and [Hugging Face](Hugging-Face-Repo-Guide.md) Guides presented on this site. As with the contents of these Guides, these checklists are based on a combination of existing guides (e.g., [The Turing Way](https://book.the-turing-way.org/), the [Model Card Guidebook](https://huggingface.co/docs/hub/en/model-card-annotated), and the [Dataset Card Creation Guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md)) and the experiences of our team. Following these checklists ensures digital products are aligned with FAIR principles and a best-effort toward reproducibility.[^2] + +!!! tip "Pro tip" + + Use the eye icon at the top of any checklist page to access the source and copy the markdown for the checklist into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each. When added to the main description of the issue, the issue summary will show _x_ out of total components completed for that issue. + +The last topic in this section discusses different methods of [DOI Generation](DOI-Generation.md) for digital products (code, data, and models). It focuses on our selected method for dataset publication: [Hugging Face](https://huggingface.co/), with some guidance on using [Zenodo](https://zenodo.org/) to archive code (specifically, a GitHub repository). For more information about other common data publication venues—and to see the thought process behind our selection—see the [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) for more information.[^3] Generating a DOI for a digital product is part of ensuring a globally unique and persistent identifier that can be used to reference and refer back to a digital product—an important component of FAIR and Reproducible principles. + +!!! info "References and Background" + If you want to learn more about FAIR and Reproducible principles, explore these resources that we used when developing this guide: + + - [The Turing Way](https://book.the-turing-way.org/): an open-source, community data science handbook. It provides a strong foundation on the guiding principles for _this_ Guide, providing accessible explanations and overviews of topics from [reproducibility](https://book.the-turing-way.org/reproducible-research/reproducible-research), to [collaboration](https://book.the-turing-way.org/collaboration/collaboration) and [communication](https://book.the-turing-way.org/communication/communication), to [project design](https://book.the-turing-way.org/project-design/project-design), to [ethical research](https://book.the-turing-way.org/ethical-research/ethical-research). + + _This is a particularly good resource for those [just starting to use `git` and GitHub](https://book.the-turing-way.org/reproducible-research/vcs/vcs-git). It builds motivation for use of version control through the lens of reproducibility._ + - Go-FAIR Initiative: [The FAIR Principles](https://www.go-fair.org/fair-principles/) + - Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. [https://huggingface.co/docs/hub/en/model-card-guidebook](https://huggingface.co/docs/hub/en/model-card-guidebook). + + _The authors also provide a nice [summary of related work](https://huggingface.co/docs/hub/en/model-card-landscape-analysis), including [Datasheets for Datasets (Gebru, et al., 2018)](https://www.fatml.org/media/documents/datasheets_for_datasets.pdf) and The Dataset Nutrition Label ([label](https://datanutrition.org/labels/), [paper](https://arxiv.org/abs/1805.03677))._ + - Wilkinson, M., Dumontier, M., Aalbersberg, I. _et al._ The FAIR Guiding Principles for scientific data management and stewardship. _Sci Data_ **3**, 160018 (2016). [10.1038/sdata.2016.18](https://doi.org/10.1038/sdata.2016.18) + - Barker, M., Chue Hong, N.P., Katz, D.S. _et al._ Introducing the FAIR Principles for research software. _Sci Data_ **9**, 622 (2022). [10.1038/s41597-022-01710-x](https://doi.org/10.1038/s41597-022-01710-x) + - Balk, M. A., Bradley, J., Maruf, M., Altintaş, B., Bakiş, Y., Bart, H. L. Jr, Breen, D., Florian, C. R., Greenberg, J., Karpatne, A., Karnani, K., Mabee, P., Pepper, J., Jebbia, D., Tabarin, T., Wang, X., & Lapp, H. (2024). A FAIR and modular image-based workflow for knowledge discovery in the emerging field of imageomics. _Methods in Ecology and Evolution_, 15, 1129–1145. [10.1111/2041-210X.14327](https://doi.org/10.1111/2041-210X.14327) + - The [FARR Research Coordination Network](https://www.farr-rcn.org/) has a number of interesting resources and events. + - The [Research Data Aliance for Interdisciplinary Research](https://www.rd-alliance.org/disciplines/rda-for-interdisciplinary-research/) also provides links to resources and events particularly focused on considerations in interdisciplinary research. + +!!! question "[Questions, Comments, or Concerns?](https://github.com/Imageomics/Imageomics-guide/issues)" + +[^1]: While "Reproducible" is not part of the original FAIR principles as defined by the [Go-FAIR Initiative](https://www.go-fair.org/fair-principles/), we include it here to emphasize the importance of computational reproducibility alongside data stewardship. This extension reflects emerging practice in data-intensive science, where code, models, and workflows must be reusable and verifiable to support robust scientific claims. It is not part of the formal FAIR acronym, but aligns with broader community goals for open and transparent research. +[^2]: Full reproducibility is difficult to achieve; this [presentation](https://drive.google.com/file/d/1BFqZ00zMuyVHaD9A8PvzRDEg7aV0kp3W/view?usp=drive_link) by Odd Erik Gundersen provides a discussion of the varying degrees of reproducibilityand useful references when considering the level of reproducibility achieved by a given project. +[^3]: The [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) was created in May 2023 as part of developing archive recommendations for the Institute, so it does not include information about newer features such as [Hugging Face's dataset viewer](https://huggingface.co/docs/hub/en/datasets-viewer), which greatly simplifies previewing datasets for downstream users. diff --git a/docs/wiki-guide/Metadata-Guide.md b/docs/wiki-guide/Metadata-Checklist.md similarity index 63% rename from docs/wiki-guide/Metadata-Guide.md rename to docs/wiki-guide/Metadata-Checklist.md index 0891c94..96a5dc7 100644 --- a/docs/wiki-guide/Metadata-Guide.md +++ b/docs/wiki-guide/Metadata-Checklist.md @@ -1,42 +1,43 @@ -# Metadata Guide +# Metadata Checklist -When collecting or compiling new data, there are generally questions one is _trying_ to answer. There are also often questions that will come up later—whether for yourself or others interested in using your data. +When collecting or compiling new data, there are generally questions one is _trying_ to answer. There are also often questions that will come up later—whether for yourself or others interested in using your data. To improve both the _**Findability**_ and _**Reusability**_ of your data (ensuring [FAIR principles](Glossary-for-Imageomics.md#fair-data-principles)) for yourself and others, be sure to note down the following information. !!! note "This is not an exhaustive list." - Be sure to include any other information that may be important to your particular project or field. + Be sure to include any other information that may be important to your particular project or field. For instance, see the [Code](Code-Checklist.md), [Data](Data-Checklist.md), and [Model](Model-Checklist.md) Checklists included in this section. ## Checklist for Metadata to Record + - [ ] **Description:** Summary of your data, for instance: - - What are the contents of the data (images, text, type of animal)? - - Is it machine-ready? - - Where did it come from (Source)? + - What are the contents of the data (images, text, type of animal)? + - Is it machine-ready? + - Where did it come from (Source)? - [ ] **Data Sources:** Machine-readable sources of the data (links or other files). -- [ ] **License Information:** This is part of retaining records of a data source (eg., museum images, previous dataset). A record of licenses on the images must be retained to ensure they are respected. If dealing with CC licenses, please see this [OSU Library CC best practices guide](https://library.osu.edu/sites/default/files/2022-10/attributing_cc_license_flyer_2022_ac.pdf). +- [ ] **License Information:** This is part of retaining records of a data source (e.g., museum images, previous dataset). A record of licenses on the images must be retained to ensure they are respected. If dealing with CC licenses, please see this [OSU Library CC best practices guide](https://library.osu.edu/sites/default/files/2022-10/attributing_cc_license_flyer_2022_ac.pdf). - [ ] **Dataset Structure:** - - Organization of the full dataset (eg., file structure). + - Organization of the full dataset (e.g., file structure). - Feature information: Information available for each image, such as species and subspecies designations, location information, etc. - - Instance information: Image type (jpg, tiff, png), number of pixels per image, coloring (RGB, UV), presence of scale or color indicators (ruler or ColorChecker), etc. + - Instance information: Image type (jpg, tiff, png), number of pixels per image, color space (RGB, UV), presence of scale or color indicators (ruler or ColorChecker), etc. - [ ] **Processing Steps:** List modifications performed (as they're done) and include links to the code used (this _should_ be organized somewhere, like a GitHub repository). - Similarly, include any annotation process information. -- [ ] **Tasks:** What could this dataset be used for (eg., image classification, feature extraction, image segmentation, etc.). +- [ ] **Tasks:** What could this dataset be used for (e.g., image classification, feature extraction, image segmentation, etc.). - [ ] **Curation Rationale:** Why are you collecting and/or modifying this data? - - This ties into the question of tasks it could be applied to, both to help maintain the group focus, and increase the likelihood others interested in answering similar questions will be able to find and use your data. -- [ ] **Author:** The curator(s)/editor(s) of the data. Assumes sufficient modification of the data by you (and your team) or that you have collected it. - - If thinking about publishing the data, add ORCID to all Authors; these can be looked up on [orcid.org](https://orcid.org/). -- [ ] **Related Publication:** Any papers that are based on this dataset. + - This ties into the question of tasks it could be applied to, both to help maintain the group focus, and increase the likelihood others interested in answering similar questions will be able to find and use your data. +- [ ] **Author:** The curator(s)/editor(s) of the data. Assumes sufficient modification of the data by you (and your team) or that you have collected it. + - If thinking about publishing the data, add ORCID to all Authors; these can be looked up on [orcid.org](https://orcid.org/). +- [ ] **Related Publications:** Any papers that are based on this dataset. - [ ] **Related Datasets:** Provide links to any related datasets (may include previous/background research). - [ ] **Other References:** Links to any related/background articles. -- [ ] **Keywords/Tags:** Terms one might search to find this dataset, eg., type(s) of animals, type(s) of images, imbalanced (if not even distribution of species/subspecies/etc). +- [ ] **Keywords/Tags:** Terms one might search to find this dataset, e.g., type(s) of animals, type(s) of images, imbalanced (if not even distribution of species/subspecies/etc). - It helps to keep a running list. - [ ] **Notes:** Any other image/data information. -!!! warning "Remember" +!!! warning "Remember" Datasets **_cannot_** be redistributed without this information. -!!! tip "Pro tip" +!!! tip "Pro tip" Use the eye icon at the top of this page to access the source and copy the markdown for the checklist above into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each. diff --git a/docs/wiki-guide/Model-Checklist.md b/docs/wiki-guide/Model-Checklist.md new file mode 100644 index 0000000..adc6f27 --- /dev/null +++ b/docs/wiki-guide/Model-Checklist.md @@ -0,0 +1,130 @@ +# Model Card Checklist + +Below is a checklist encompassing all sections of a model card. Review notes and guidance provided in the full [model card template](HF_ModelCard_Template_mkdocs.md/) for more details. + +!!! tip "Pro tip" + + Use the eye icon at the top of this page to access the source and copy the markdown for the checklist below into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each element to your [model card](HF_ModelCard_Template_mkdocs.md). + +## General Information + +- [ ] **Model Name**: Provide the name of the model. +- [ ] **Model Summary**: Provide a quick summary of what the model is/does +- [ ] **License**: Choose an appropriate license (e.g., `cc0-1.0`). +- [ ] **Language(s)**: Specify the language(s) used (e.g., `en`). +- [ ] **Tags**: Include relevant tags (e.g., `biology`, `CV`, `images`, `animals`). +- [ ] **Datasets**: List datasets used for training, linking if hosted on Hugging Face. E.g.: imageomics/TreeOfLife-10M +- [ ] **Metrics**: Specify key evaluation metrics (refer to [Hugging Face metrics list](https://hf.co/metrics)). + +--- + +## Model Details + +- [ ] **Detailed Summary**: Provide a longer summary of what this model is. +- [ ] **Developed by**: List the developers. +- [ ] **Model Type**: Describe the model type. +- [ ] **Fine-tuned from**: Specify the base model if fine-tuned. +- [ ] **Version**: Indicate the model version. +- [ ] **Repository**: Provide the link to the project repository (GitHub). +- [ ] **Paper**: Link to any associated research papers (not expected at this point). +- [ ] **Demo**: Link to an interactive demo (if available). + +--- + +## Uses + +- [ ] **Direct Use**: Describe how the model can be used without fine-tuning or plugging into a larger ecosystem/app. +- [ ] **Downstream Use**: List potential fine-tuned applications for a task, or plugging into a larger ecosystem/app. +- [ ] **Out-of-Scope Use**: Indicate any misuse, malicious use, and uses that the model will not work well for. + +--- + +## Bias, Risks, and Limitations + +- [ ] **Bias, Risks, and Limitations**: Discuss potential biases and in the model, along with possible mitigations. +- [ ] **Recommendations**: Provide responsible usage recommendations with respect to the bias, risk, and technical limitations. + +--- + +## Getting Started + +- [ ] **Usage Instructions**: Provide example code for using the model. +- [ ] **Installation Guide**: List dependencies and installation steps. + +--- + +## Training Details + +- [ ] **Training Data**: Describe the dataset used for training. This should link to a Dataset Card where possible, otherwise link to the original source with more info. +- [ ] **Preprocessing**: Detail data preprocessing techniques. +- [ ] **Training Procedure**: Describe the training approach. +- [ ] **Training Hyperparameters**: List key hyperparameters used. +- [ ] **Speeds, Sizes, Times**: Provide information about throughput, start/end time, checkpoint size if relevant, etc. + +--- + +## Evaluation + +This section describes the evaluation protocols and provides the results. + +- [ ] **Testing Data**: Describe the dataset used for testing. This should link to a Dataset Card if possible, otherwise link to the original source with more info. +- [ ] **Factors**: Describe evaluation criteria (e.g., subpopulations, domains). +- [ ] **Metrics**: Specify evaluation metrics and reasoning. +- [ ] **Results**: Summarize model performance on testing data +- [ ] **Benchmark Comparisons**: Compare with existing baselines. + +--- + +## Model Examination + +- [ ] **Interpretability**: Provide information on model explainability. +- [ ] **Visualization**: Include any relevant visualizations. + +--- + +## Environmental Impact + +- [ ] **Compute Region**: Specify cloud provider and region. +- [ ] **Hardware Type**: List GPUs and CPUs used. +- [ ] **Training Hours**: Estimate the total training time. +- [ ] **Carbon Emissions**: Calculate emissions using the [ML Impact calculator](https://mlco2.github.io/impact#compute). + +--- + +## Technical Specifications + +- [ ] **Model Architecture**: Provide a detailed architecture description and the choices behind its selection. +- [ ] **Performance Metrics**: List performance metrics and their significance. +- [ ] **Model Size**: Specify the model size in MB. +- [ ] **Compute Requirements**: List hardware and software requirements. + +--- + +## Licensing and Citation + +See discussion and references in the [template](HF_ModelCard_Template_mkdocs.md/#__codelineno-0-19), also remember the [digital product release and licensing policy](Digital-products-release-licensing-policy.md/). + +- [ ] **License**: Confirm licensing details. +- [ ] **Citation**: Provide a BibTeX citation for the model and associated paper. + +--- + +## Acknowledgements + +- [ ] **Funding and Support**: List sources of funding and institutional support. + +--- + +## Glossary (Optional) + +- [ ] **Definitions**: Provide explanations for technical terms. + +--- + +## Additional Information + +- [ ] **Notes**: Include any other relevant details. +- [ ] **Model Card Authors**: List contributors to the model card. +- [ ] **Model Card Contact**: [OPTIONAL] We recommend people use HF discussions, but you may indicate a person to contact. + +!!! question "[Questions, Comments, or Concerns?](https://github.com/Imageomics/Imageomics-guide/issues)" diff --git a/mkdocs.yaml b/mkdocs.yaml index ea00471..11bd1e2 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -102,8 +102,12 @@ nav: - "Workflow": wiki-guide/The-Hugging-Face-Workflow.md - "Dataset Upload Guide": wiki-guide/The-Hugging-Face-Dataset-Upload-Guide.md - "Why Use the Institute Hugging Face": wiki-guide/Why-use-the-Institute-Hugging-Face.md - - Metadata Guide: - - "Metadata Guide": wiki-guide/Metadata-Guide.md + - FAIR Guide: + - "About FAIR Principles": wiki-guide/FAIR-Guide.md + - "Metadata Checklist": wiki-guide/Metadata-Checklist.md + - "Code Repo Checklist": wiki-guide/Code-Checklist.md + - "Data Card Checklist": wiki-guide/Data-Checklist.md + - "Model Card Checklist": wiki-guide/Model-Checklist.md - "DOI Generation": wiki-guide/DOI-Generation.md - Templates: - "About Templates": wiki-guide/About-Templates.md