From 67da3a10bb7b7341d0e00de4613450deed5c0088 Mon Sep 17 00:00:00 2001 From: egrace479 Date: Fri, 14 Feb 2025 18:02:54 -0500 Subject: [PATCH 01/33] Add checklists for GH and HF expectations As written for AI and Ecology Course 2025 Co-authored-by: Net Zhang --- docs/wiki-guide/Code-Checklist.md | 104 +++++++++++++++++++++++++ docs/wiki-guide/Data-Checklist.md | 99 ++++++++++++++++++++++++ docs/wiki-guide/Model-Checklist.md | 120 +++++++++++++++++++++++++++++ 3 files changed, 323 insertions(+) create mode 100644 docs/wiki-guide/Code-Checklist.md create mode 100644 docs/wiki-guide/Data-Checklist.md create mode 100644 docs/wiki-guide/Model-Checklist.md diff --git a/docs/wiki-guide/Code-Checklist.md b/docs/wiki-guide/Code-Checklist.md new file mode 100644 index 0000000..7def419 --- /dev/null +++ b/docs/wiki-guide/Code-Checklist.md @@ -0,0 +1,104 @@ +# Code Checklist + +This checklist provides expectations for the code repositories created for the Experiential AI & Ecology Course (Spring Semester 2025). + +## Required Files +- [ ] **License**: Verify and include an appropriate license (e.g., `MIT`, `CC0-1.0`, etc.). See discussion in the [guide](https://imageomics.github.io/Imageomics-guide/wiki-guide/GitHub-Repo-Guide/#license). +- [ ] **README File**: Following the [guide](https://imageomics.github.io/Imageomics-guide/wiki-guide/GitHub-Repo-Guide/#readme), provide a detailed `README.md` with: + - [ ] Overview of the project. + - [ ] Installation instructions. + - [ ] Basic usage examples. + - [ ] Links to related/created dataset(s). + - [ ] Links to related/created model(s). + - [ ] Acknowledge source code dependencies and contributors. + - [ ] Reference related datasets used in training or evaluation. +- [ ] **Requirements File**: Provide a [file detailing software requirements](https://imageomics.github.io/Imageomics-guide/wiki-guide/GitHub-Repo-Guide/#software-requirements-file), such as a `requirements.txt` or `pyproject.toml` for Python dependencies. +- [ ] **Gitignore File**: GitHub has premade `.gitignore` files ([here](https://github.com/github/gitignore)) tailored to particular languages (eg., [R](https://github.com/github/gitignore/blob/main/R.gitignore) or [Python](https://github.com/github/gitignore/blob/main/Python.gitignore)), operating systems, etc. +- [ ] **CITATION CFF**: This facilitates citation of your work, follow guidance provided in the [guide](https://imageomics.github.io/Imageomics-guide/wiki-guide/GitHub-Repo-Guide/#citation). + +### Data-Related +- [ ] Preprocessing code. +- [ ] Description of dataset(s), including description of training and testing sets (with links to relevant portions of dataset card, which will have more information). + +### Model-Related +- [ ] Training code. +- [ ] Inference/evaluation code. +- [ ] Model weights (if not in Hugging Face model repository). +- [ ] Description of model(s)/benchmark(s). +- [ ] Explanation of training and testing (with links to relevant portions of model card, which will have more information). + +> [!NOTE] +> The [bioclip GitHub repository](https://github.com/Imageomics/bioclip) provides an example of incorporating data-and model-related code into a GitHub repository as published open-source code for both data and model development. + +## General Information + +- [ ] **Repository Structure**: Ensure the code repository follows a clear and logical directory structure. (See [guide](https://imageomics.github.io/Imageomics-guide/wiki-guide/GitHub-Repo-Guide/#general-repository-structure).) +- [ ] **Code Comments**: Include meaningful inline comments and function descriptions for clarity. +- [ ] **Random Seed Control**: Save random seeds to ensure reproducible results. + + +## Security Considerations + +- [ ] **Sensitive Data Handling**: Ensure no hardcoded sensitive information (e.g., API keys, credentials) are included in your repository. These can be shared through a config file on OSC. + + +> [!NOTE] +> The best practices described below will help you meet the above requirements. The more advanced development practices noted further down are included for educational purposes and since some groups have chosen to use linters. They are not required for this course; however, we recommend having a group discussion about the topics covered in [Code Quality](#code-quality). + +--- + +# Best Practices + +The [Repo Guide](https://imageomics.github.io/Imageomics-guide/wiki-guide/GitHub-Repo-Guide/) provides general guidance on repository structure, [collaborative workflow](https://imageomics.github.io/Imageomics-guide/wiki-guide/The-GitHub-Workflow/), and [how to make and review pull requests (PR)](https://imageomics.github.io/Imageomics-guide/wiki-guide/The-GitHub-Pull-Request-Guide/). Below, we highlight some best practices in checklist form that are recommended for this course and beyond. + +## Reproducibility + +- **Version Control**: Use Git for version control and commit regularly. +- **Modularization**: Structure code into reusable and independent modules. +- **Code Execution**: Provide Notebooks to demonstrate how to reproduce results. + + +## Code Review & Maintenance + +- **Code Reviews**: Regular peer reviews for quality assurance. Refer to the [GitHub Repo Guide](https://imageomics.github.io/Imageomics-guide/wiki-guide/GitHub-Repo-Guide/). +- **Issue Tracking**: Use GitHub issues for tracking bugs and feature requests. +- **Versioning**: Tag releases, changelogs can be auto-generated and informative when PRs are appropriately scoped. + + +## Installation and Dependencies + +- [ ] **Environment Setup**: Include setup instructions (e.g., `conda` environment file, `Dockerfile`). +- [ ] **Dependency Management**: Use virtual environments (e.g., `venv`, `conda`, `uv` for Python) to isolate dependencies. + +--- + +# More Advanced Development + +## Documentation + +- [ ] **API Documentation**: Generate API documentation (e.g., `MkDocs` for Python or wiki pages in the repo). +- [ ] **Docstrings**: Add comprehensive docstrings for all functions, classes, and modules. These can be incorporated to help generate documentation. +- [ ] **Example Scripts**: Include example scripts for common use cases. +- [ ] **Configuration Files**: Use `yaml`, `json`, or `ini` for configuration settings. + + +## Code Quality + +- [ ] **Consistent Style**: Follow coding style guidelines (e.g., `PEP 8` for Python). +- [ ] **Linting**: Ensure the code passes a linter (e.g., `Ruff` for Python). +- [ ] **Logging**: Use logging instead of print statements for better debugging (e.g., `logging` in Python). +- [ ] **Error Handling**: Implement robust exception handling to avoid crashes. + + +## Testing + +- [ ] **Unit Tests**: Write unit tests to validate core functionality. +- [ ] **Integration Tests**: Ensure components work together correctly. +- [ ] **Test Coverage**: Check test coverage +- [ ] **Continuous Integration (CI)**: Set up CI/CD pipelines (e.g., GitHub Actions) for automated testing. + + +## Code Distribution & Deployment + +- [ ] **Packaging**: Provide installation instructions (e.g., `setup.py`, `hatch`, `poetry`, `uv` for Python). +- [ ] **Deployment Guide**: Document deployment procedures diff --git a/docs/wiki-guide/Data-Checklist.md b/docs/wiki-guide/Data-Checklist.md new file mode 100644 index 0000000..e8ac7aa --- /dev/null +++ b/docs/wiki-guide/Data-Checklist.md @@ -0,0 +1,99 @@ +# Dataset Card Checklist +Below is a checklist encompassing all sections of a dataset card. Review notes and guidance provided in the full [datatset card template](https://imageomics.github.io/Imageomics-guide/wiki-guide/HF_DatasetCard_Template_mkdocs/) for more details. + +## General Information + +- [ ] **License**: Verify and specify the license type (e.g., `cc0-1.0`). +- [ ] **Language**: Indicate the language(s) (e.g., `en`). +- [ ] **Pretty Name**: Provide a descriptive name for the dataset. +- [ ] **Task Categories**: List relevant task categories (e.g., image-classification). Refer to [task categories](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/pipelines.ts). +- [ ] **Tags**: Include relevant tags (e.g., `biology`, `image`, `animals`, `CV`). +- [ ] **Size Categories**: Specify dataset size (e.g., `n<1K`, `1K.png`, each within a folder named for the species. They are 1024 x 1024, and the color has been standardized using ``. +- [ ] **Data Fields**: Describe the types of the data files or the columns in a CSV with metadata. +- [ ] **Data Splits**: Describe any splits (e.g., train, test, validation). + +--- + +## Dataset Creation +Refer to examples and explanations provided in the full [dataset card template](https://imageomics.github.io/Imageomics-guide/wiki-guide/HF_DatasetCard_Template_mkdocs/). Much of this should have been filled out before leaving Hawaii. + +- [ ] **Curation Rationale**: Explain why this dataset was created. +- [ ] **Source Data**: Describe the source data. + - [ ] **Data Collection and Processing**: Describe data creation, selection, filtering, normalization, and tools used. + - [ ] **Source Data Producers**: List original data producers or sources. +- [ ] **Annotations**: Include details on annotations. + - [ ] **Annotation Process**: Describe the process and tools used. + - [ ] **Annotators**: List the annotators if applicable. +- [ ] **Personal and Sensitive Information**: Indicate any sensitive information in the dataset. + +--- + +## Considerations for Using the Data +Things to consider while working with the dataset. For instance, maybe there are hybrids and they are labeled in the `hybrid_stat` column, so to get a subset without hybrids, subset to all instances in the metadata file such that `hybrid_stat` is _not_ "hybrid". + +- [ ] **Bias, Risks, and Limitations**: Describe any known issues with the dataset. For instance, if your data exhibits a long-tailed distribution (and why). +- [ ] **Recommendations**: Provide recommendations for using the dataset responsibly. + +--- + +## Licensing Information +See discussion and references in [template](https://imageomics.github.io/Imageomics-guide/wiki-guide/About-Templates/), also remember the [digital product release and licensing policy](https://imageomics.github.io/Imageomics-guide/wiki-guide/Digital-products-release-licensing-policy/). + +- [ ] **Licensing Details**: Confirm and list all licensing details. + +--- + +## Citation + +- [ ] **Data Citation**: Provide a BibTeX citation for the dataset. +- [ ] **Associated Paper Citation**: Provide a BibTeX citation for any associated papers. + +--- + +## Acknowledgements + +- [ ] **Acknowledgements**: Include funding or support acknowledgments. + +--- + +## Glossary (Optional) + +- [ ] **Glossary**: Provide definitions for relevant terms or calculations. + +--- + +## More Information (Optional) + +- [ ] **Additional Information**: Add any other relevant information. + +--- + +## Dataset Card Authors + +- [ ] **Authors**: List the authors of the dataset card. + +--- + +## Dataset Card Contact + +- [ ] **Contact Information**: [OPTIONAL] We recommend people use HF discussions, but you may indicate a person to contact. diff --git a/docs/wiki-guide/Model-Checklist.md b/docs/wiki-guide/Model-Checklist.md new file mode 100644 index 0000000..c421f79 --- /dev/null +++ b/docs/wiki-guide/Model-Checklist.md @@ -0,0 +1,120 @@ +# Model Card Checklist + +Below is a checklist encompassing all sections of a model card. Review notes and guidance provided in the full [model card template](https://imageomics.github.io/Imageomics-guide/wiki-guide/HF_ModelCard_Template_mkdocs/) for more details. + +## General Information + +- [ ] **Model Name**: Provide the name of the model. +- [ ] **Model Summary**: Provide a quick summary of what the model is/does +- [ ] **License**: Choose an appropriate license (e.g., `cc0-1.0`). +- [ ] **Language(s)**: Specify the language(s) used (e.g., `en`). +- [ ] **Tags**: Include relevant tags (e.g., `biology`, `CV`, `images`, `animals`). +- [ ] **Datasets**: List datasets used for training, linking if hosted on Hugging Face. Ex: imageomics/TreeOfLife-10M +- [ ] **Metrics**: Specify key evaluation metrics (refer to [Hugging Face metrics list](https://hf.co/metrics)). + +--- + +## Model Details + +- [ ] **Detailed Summary**: Provide a longer summary of what this model is. +- [ ] **Developed by**: List the developers. +- [ ] **Model Type**: Describe the model type. +- [ ] **Fine-tuned from**: Specify the base model if fine-tuned. +- [ ] **Version**: Indicate the model version. +- [ ] **Repository**: Provide the link to the project repository (GitHub). +- [ ] **Paper**: Link to any associated research papers (not expected at this point). +- [ ] **Demo**: Link to an interactive demo (if available). + +--- + +## Uses + +- [ ] **Direct Use**: Describe how the model can be used without fine-tuning or plugging into a larger ecosystem/app. +- [ ] **Downstream Use**: List potential fine-tuned applications for a task, or plugging into a larger ecosystem/app. +- [ ] **Out-of-Scope Use**: Indicate any misuse, malicious use, and uses that the model will not work well for. + +--- + +## Bias, Risks, and Limitations + +- [ ] **Bias, Risks, and Limitations**: Discuss potential biases and in the model, along with possible mitigations. +- [ ] **Recommendations**: Provide responsible usage recommendations with respect to the bias, risk, and technical limitations. + +--- + +## Getting Started + +- [ ] **Usage Instructions**: Provide example code for using the model. +- [ ] **Installation Guide**: List dependencies and installation steps. + +--- + +## Training Details + +- [ ] **Training Data**: Describe the dataset used for training. This should link to a Dataset Card where possible, otherwise link to the original source with more info. +- [ ] **Preprocessing**: Detail data preprocessing techniques. +- [ ] **Training Procedure**: Describe the training approach. +- [ ] **Training Hyperparameters**: List key hyperparameters used. +- [ ] **Speeds, Sizes, Times**: Provide information about throughput, start/end time, checkpoint size if relevant, etc. + +--- + +## Evaluation +This section describes the evaluation protocols and provides the results. + +- [ ] **Testing Data**: Describe the dataset used for testing. This should link to a Dataset Card if possible, otherwise link to the original source with more info. +- [ ] **Factors**: Describe evaluation criteria (e.g., subpopulations, domains). +- [ ] **Metrics**: Specify evaluation metrics and reasoning. +- [ ] **Results**: Summarize model performance on testing data +- [ ] **Benchmark Comparisons**: Compare with existing baselines. + +--- + +## Model Examination + +- [ ] **Interpretability**: Provide information on model explainability. +- [ ] **Visualization**: Include any relevant visualizations. + +--- + +## Environmental Impact + +- [ ] **Compute Region**: Specify cloud provider and region. +- [ ] **Hardware Type**: List GPUs and CPUs used. +- [ ] **Training Hours**: Estimate the total training time. +- [ ] **Carbon Emissions**: Calculate emissions using the [ML Impact calculator](https://mlco2.github.io/impact#compute). + +--- + +## Technical Specifications + +- [ ] **Model Architecture**: Provide a detailed architecture description and the objective behind. +- [ ] **Compute Requirements**: List hardware and software requirements. + +--- + +## Licensing and Citation +See discussion and references in [template](https://imageomics.github.io/Imageomics-guide/wiki-guide/About-Templates/), also remember the [digital product release and licensing policy](https://imageomics.github.io/Imageomics-guide/wiki-guide/Digital-products-release-licensing-policy/). + +- [ ] **License**: Confirm licensing details. +- [ ] **Citation**: Provide a BibTeX citation for the model and associated paper. + +--- + +## Acknowledgements + +- [ ] **Funding and Support**: List sources of funding and institutional support. + +--- + +## Glossary (Optional) + +- [ ] **Definitions**: Provide explanations for technical terms. + +--- + +## Additional Information + +- [ ] **Notes**: Include any other relevant details. +- [ ] **Model Card Authors**: List contributors to the model card. +- [ ] **Model Card Contact**: [OPTIONAL] We recommend people use HF discussions, but you may indicate a person to contact. From 9ac2681f9bf264ec54fc4caf4e575926ad837023 Mon Sep 17 00:00:00 2001 From: egrace479 Date: Fri, 14 Feb 2025 18:33:32 -0500 Subject: [PATCH 02/33] Update formatting for Mkdocs --- docs/wiki-guide/Code-Checklist.md | 30 +++++++++++++++--------------- docs/wiki-guide/Data-Checklist.md | 8 ++++---- docs/wiki-guide/Model-Checklist.md | 4 ++-- 3 files changed, 21 insertions(+), 21 deletions(-) diff --git a/docs/wiki-guide/Code-Checklist.md b/docs/wiki-guide/Code-Checklist.md index 7def419..3f30c54 100644 --- a/docs/wiki-guide/Code-Checklist.md +++ b/docs/wiki-guide/Code-Checklist.md @@ -3,8 +3,8 @@ This checklist provides expectations for the code repositories created for the Experiential AI & Ecology Course (Spring Semester 2025). ## Required Files -- [ ] **License**: Verify and include an appropriate license (e.g., `MIT`, `CC0-1.0`, etc.). See discussion in the [guide](https://imageomics.github.io/Imageomics-guide/wiki-guide/GitHub-Repo-Guide/#license). -- [ ] **README File**: Following the [guide](https://imageomics.github.io/Imageomics-guide/wiki-guide/GitHub-Repo-Guide/#readme), provide a detailed `README.md` with: +- [ ] **License**: Verify and include an appropriate license (e.g., `MIT`, `CC0-1.0`, etc.). See discussion in the [guide](GitHub-Repo-Guide.md/#license). +- [ ] **README File**: Following the [guide](GitHub-Repo-Guide.md/#readme), provide a detailed `README.md` with: - [ ] Overview of the project. - [ ] Installation instructions. - [ ] Basic usage examples. @@ -12,9 +12,9 @@ This checklist provides expectations for the code repositories created for the E - [ ] Links to related/created model(s). - [ ] Acknowledge source code dependencies and contributors. - [ ] Reference related datasets used in training or evaluation. -- [ ] **Requirements File**: Provide a [file detailing software requirements](https://imageomics.github.io/Imageomics-guide/wiki-guide/GitHub-Repo-Guide/#software-requirements-file), such as a `requirements.txt` or `pyproject.toml` for Python dependencies. +- [ ] **Requirements File**: Provide a [file detailing software requirements](GitHub-Repo-Guide.md/#software-requirements-file), such as a `requirements.txt` or `pyproject.toml` for Python dependencies. - [ ] **Gitignore File**: GitHub has premade `.gitignore` files ([here](https://github.com/github/gitignore)) tailored to particular languages (eg., [R](https://github.com/github/gitignore/blob/main/R.gitignore) or [Python](https://github.com/github/gitignore/blob/main/Python.gitignore)), operating systems, etc. -- [ ] **CITATION CFF**: This facilitates citation of your work, follow guidance provided in the [guide](https://imageomics.github.io/Imageomics-guide/wiki-guide/GitHub-Repo-Guide/#citation). +- [ ] **CITATION CFF**: This facilitates citation of your work, follow guidance provided in the [guide](GitHub-Repo-Guide.md/#citation). ### Data-Related - [ ] Preprocessing code. @@ -27,12 +27,12 @@ This checklist provides expectations for the code repositories created for the E - [ ] Description of model(s)/benchmark(s). - [ ] Explanation of training and testing (with links to relevant portions of model card, which will have more information). -> [!NOTE] -> The [bioclip GitHub repository](https://github.com/Imageomics/bioclip) provides an example of incorporating data-and model-related code into a GitHub repository as published open-source code for both data and model development. +!!! note + The [bioclip GitHub repository](https://github.com/Imageomics/bioclip) provides an example of incorporating data-and model-related code into a GitHub repository as published open-source code for both data and model development. ## General Information -- [ ] **Repository Structure**: Ensure the code repository follows a clear and logical directory structure. (See [guide](https://imageomics.github.io/Imageomics-guide/wiki-guide/GitHub-Repo-Guide/#general-repository-structure).) +- [ ] **Repository Structure**: Ensure the code repository follows a clear and logical directory structure. (See [guide](GitHub-Repo-Guide.md/#general-repository-structure).) - [ ] **Code Comments**: Include meaningful inline comments and function descriptions for clarity. - [ ] **Random Seed Control**: Save random seeds to ensure reproducible results. @@ -42,30 +42,30 @@ This checklist provides expectations for the code repositories created for the E - [ ] **Sensitive Data Handling**: Ensure no hardcoded sensitive information (e.g., API keys, credentials) are included in your repository. These can be shared through a config file on OSC. -> [!NOTE] -> The best practices described below will help you meet the above requirements. The more advanced development practices noted further down are included for educational purposes and since some groups have chosen to use linters. They are not required for this course; however, we recommend having a group discussion about the topics covered in [Code Quality](#code-quality). +!!! note + The best practices described below will help you meet the above requirements. The more advanced development practices noted further down are included for educational purposes and since some groups have chosen to use linters. They are not required for this course; however, we recommend having a group discussion about the topics covered in [Code Quality](#code-quality). --- -# Best Practices +## Best Practices -The [Repo Guide](https://imageomics.github.io/Imageomics-guide/wiki-guide/GitHub-Repo-Guide/) provides general guidance on repository structure, [collaborative workflow](https://imageomics.github.io/Imageomics-guide/wiki-guide/The-GitHub-Workflow/), and [how to make and review pull requests (PR)](https://imageomics.github.io/Imageomics-guide/wiki-guide/The-GitHub-Pull-Request-Guide/). Below, we highlight some best practices in checklist form that are recommended for this course and beyond. +The [Repo Guide](GitHub-Repo-Guide.md/) provides general guidance on repository structure, [collaborative workflow](The-GitHub-Workflow.md/), and [how to make and review pull requests (PR)](The-GitHub-Pull-Request-Guide.md/). Below, we highlight some best practices in checklist form that are recommended for this course and beyond. -## Reproducibility +### Reproducibility - **Version Control**: Use Git for version control and commit regularly. - **Modularization**: Structure code into reusable and independent modules. - **Code Execution**: Provide Notebooks to demonstrate how to reproduce results. -## Code Review & Maintenance +### Code Review & Maintenance -- **Code Reviews**: Regular peer reviews for quality assurance. Refer to the [GitHub Repo Guide](https://imageomics.github.io/Imageomics-guide/wiki-guide/GitHub-Repo-Guide/). +- **Code Reviews**: Regular peer reviews for quality assurance. Refer to the [GitHub Repo Guide](GitHub-Repo-Guide.md/). - **Issue Tracking**: Use GitHub issues for tracking bugs and feature requests. - **Versioning**: Tag releases, changelogs can be auto-generated and informative when PRs are appropriately scoped. -## Installation and Dependencies +### Installation and Dependencies - [ ] **Environment Setup**: Include setup instructions (e.g., `conda` environment file, `Dockerfile`). - [ ] **Dependency Management**: Use virtual environments (e.g., `venv`, `conda`, `uv` for Python) to isolate dependencies. diff --git a/docs/wiki-guide/Data-Checklist.md b/docs/wiki-guide/Data-Checklist.md index e8ac7aa..b5bc922 100644 --- a/docs/wiki-guide/Data-Checklist.md +++ b/docs/wiki-guide/Data-Checklist.md @@ -1,5 +1,5 @@ # Dataset Card Checklist -Below is a checklist encompassing all sections of a dataset card. Review notes and guidance provided in the full [datatset card template](https://imageomics.github.io/Imageomics-guide/wiki-guide/HF_DatasetCard_Template_mkdocs/) for more details. +Below is a checklist encompassing all sections of a dataset card. Review notes and guidance provided in the full [datatset card template](HF_DatasetCard_Template_mkdocs.md/) for more details. ## General Information @@ -26,7 +26,7 @@ Below is a checklist encompassing all sections of a dataset card. Review notes a ## Dataset Structure -- [ ] **Data Format**: Describe the structure of the dataset. See guidance on formatting in the [full dataset card template](https://imageomics.github.io/Imageomics-guide/wiki-guide/HF_DatasetCard_Template_mkdocs/). +- [ ] **Data Format**: Describe the structure of the dataset. See guidance on formatting in the [full dataset card template](HF_DatasetCard_Template_mkdocs.md/). - [ ] **Data Instances**: Describe data files. Ex: All images are named `.png`, each within a folder named for the species. They are 1024 x 1024, and the color has been standardized using ``. - [ ] **Data Fields**: Describe the types of the data files or the columns in a CSV with metadata. @@ -35,7 +35,7 @@ Ex: All images are named `.png`, each within a folder named for the spec --- ## Dataset Creation -Refer to examples and explanations provided in the full [dataset card template](https://imageomics.github.io/Imageomics-guide/wiki-guide/HF_DatasetCard_Template_mkdocs/). Much of this should have been filled out before leaving Hawaii. +Refer to examples and explanations provided in the full [dataset card template](HF_DatasetCard_Template_mkdocs.md/). Much of this should have been filled out before leaving Hawaii. - [ ] **Curation Rationale**: Explain why this dataset was created. - [ ] **Source Data**: Describe the source data. @@ -57,7 +57,7 @@ Things to consider while working with the dataset. For instance, maybe there are --- ## Licensing Information -See discussion and references in [template](https://imageomics.github.io/Imageomics-guide/wiki-guide/About-Templates/), also remember the [digital product release and licensing policy](https://imageomics.github.io/Imageomics-guide/wiki-guide/Digital-products-release-licensing-policy/). +See discussion and references in [template](About-Templates.md/), also remember the [digital product release and licensing policy](Digital-products-release-licensing-policy.md/). - [ ] **Licensing Details**: Confirm and list all licensing details. diff --git a/docs/wiki-guide/Model-Checklist.md b/docs/wiki-guide/Model-Checklist.md index c421f79..f2c065d 100644 --- a/docs/wiki-guide/Model-Checklist.md +++ b/docs/wiki-guide/Model-Checklist.md @@ -1,6 +1,6 @@ # Model Card Checklist -Below is a checklist encompassing all sections of a model card. Review notes and guidance provided in the full [model card template](https://imageomics.github.io/Imageomics-guide/wiki-guide/HF_ModelCard_Template_mkdocs/) for more details. +Below is a checklist encompassing all sections of a model card. Review notes and guidance provided in the full [model card template](HF_ModelCard_Template_mkdocs.md/) for more details. ## General Information @@ -94,7 +94,7 @@ This section describes the evaluation protocols and provides the results. --- ## Licensing and Citation -See discussion and references in [template](https://imageomics.github.io/Imageomics-guide/wiki-guide/About-Templates/), also remember the [digital product release and licensing policy](https://imageomics.github.io/Imageomics-guide/wiki-guide/Digital-products-release-licensing-policy/). +See discussion and references in [template](About-Templates.md/), also remember the [digital product release and licensing policy](Digital-products-release-licensing-policy.md/). - [ ] **License**: Confirm licensing details. - [ ] **Citation**: Provide a BibTeX citation for the model and associated paper. From 532bb036df0d80e05717e940c4a815d336a0a0b6 Mon Sep 17 00:00:00 2001 From: egrace479 Date: Mon, 17 Feb 2025 12:51:30 -0500 Subject: [PATCH 03/33] update URLs to more precise references link to relevant lines of templates and specific section of GH guide --- docs/wiki-guide/Code-Checklist.md | 2 +- docs/wiki-guide/Data-Checklist.md | 8 ++++---- docs/wiki-guide/Model-Checklist.md | 2 +- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/wiki-guide/Code-Checklist.md b/docs/wiki-guide/Code-Checklist.md index 3f30c54..78482fc 100644 --- a/docs/wiki-guide/Code-Checklist.md +++ b/docs/wiki-guide/Code-Checklist.md @@ -60,7 +60,7 @@ The [Repo Guide](GitHub-Repo-Guide.md/) provides general guidance on repository ### Code Review & Maintenance -- **Code Reviews**: Regular peer reviews for quality assurance. Refer to the [GitHub Repo Guide](GitHub-Repo-Guide.md/). +- **Code Reviews**: Regular peer reviews for quality assurance. Refer to the [GitHub PR Review Guide](The-GitHub-Pull-Request-Guide.md/#2-review-a-pull-request). - **Issue Tracking**: Use GitHub issues for tracking bugs and feature requests. - **Versioning**: Tag releases, changelogs can be auto-generated and informative when PRs are appropriately scoped. diff --git a/docs/wiki-guide/Data-Checklist.md b/docs/wiki-guide/Data-Checklist.md index b5bc922..504509a 100644 --- a/docs/wiki-guide/Data-Checklist.md +++ b/docs/wiki-guide/Data-Checklist.md @@ -26,16 +26,16 @@ Below is a checklist encompassing all sections of a dataset card. Review notes a ## Dataset Structure -- [ ] **Data Format**: Describe the structure of the dataset. See guidance on formatting in the [full dataset card template](HF_DatasetCard_Template_mkdocs.md/). +- [ ] **Data Format**: Describe the structure of the dataset. See guidance on formatting in the [full dataset card template](HF_DatasetCard_Template_mkdocs.md/#__codelineno-0-71). - [ ] **Data Instances**: Describe data files. Ex: All images are named `.png`, each within a folder named for the species. They are 1024 x 1024, and the color has been standardized using ``. -- [ ] **Data Fields**: Describe the types of the data files or the columns in a CSV with metadata. +- [ ] **Data Fields**: Describe the types of the data files or the columns in a CSV with metadata ([example](HF_DatasetCard_Template_mkdocs.md/#__codelineno-0-114)). - [ ] **Data Splits**: Describe any splits (e.g., train, test, validation). --- ## Dataset Creation -Refer to examples and explanations provided in the full [dataset card template](HF_DatasetCard_Template_mkdocs.md/). Much of this should have been filled out before leaving Hawaii. +Refer to examples and explanations provided in the full [dataset card template](HF_DatasetCard_Template_mkdocs.md/#__codelineno-0-129). Much of this should have been filled out before leaving Hawaii. - [ ] **Curation Rationale**: Explain why this dataset was created. - [ ] **Source Data**: Describe the source data. @@ -57,7 +57,7 @@ Things to consider while working with the dataset. For instance, maybe there are --- ## Licensing Information -See discussion and references in [template](About-Templates.md/), also remember the [digital product release and licensing policy](Digital-products-release-licensing-policy.md/). +See discussion and references in the [template](HF_DatasetCard_Template_mkdocs.md/#__codelineno-0-19), also remember the [digital product release and licensing policy](Digital-products-release-licensing-policy.md/). - [ ] **Licensing Details**: Confirm and list all licensing details. diff --git a/docs/wiki-guide/Model-Checklist.md b/docs/wiki-guide/Model-Checklist.md index f2c065d..55464eb 100644 --- a/docs/wiki-guide/Model-Checklist.md +++ b/docs/wiki-guide/Model-Checklist.md @@ -94,7 +94,7 @@ This section describes the evaluation protocols and provides the results. --- ## Licensing and Citation -See discussion and references in [template](About-Templates.md/), also remember the [digital product release and licensing policy](Digital-products-release-licensing-policy.md/). +See discussion and references in the [template](HF_ModelCard_Template_mkdocs.md/#__codelineno-0-19), also remember the [digital product release and licensing policy](Digital-products-release-licensing-policy.md/). - [ ] **License**: Confirm licensing details. - [ ] **Citation**: Provide a BibTeX citation for the model and associated paper. From c04c618cd89397882e4bc966af66feb8ab7f50fd Mon Sep 17 00:00:00 2001 From: egrace479 Date: Mon, 17 Feb 2025 12:53:59 -0500 Subject: [PATCH 04/33] Add references to other checklists --- docs/wiki-guide/Metadata-Guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/wiki-guide/Metadata-Guide.md b/docs/wiki-guide/Metadata-Guide.md index 0891c94..d0851b5 100644 --- a/docs/wiki-guide/Metadata-Guide.md +++ b/docs/wiki-guide/Metadata-Guide.md @@ -5,7 +5,7 @@ When collecting or compiling new data, there are generally questions one is _try To improve both the _**Findability**_ and _**Reusability**_ of your data (ensuring [FAIR principles](Glossary-for-Imageomics.md#fair-data-principles)) for yourself and others, be sure to note down the following information. !!! note "This is not an exhaustive list." - Be sure to include any other information that may be important to your particular project or field. + Be sure to include any other information that may be important to your particular project or field. See, for instance, the [Code](Code-Checklist.md), [Data](Data-Checklist.md), and [Model](Model-Checklist.md) Checklists included in this section. ## Checklist for Metadata to Record - [ ] **Description:** Summary of your data, for instance: From 8a4c1382d202c55937bf563e2e5283f63a2a7752 Mon Sep 17 00:00:00 2001 From: egrace479 Date: Mon, 17 Feb 2025 12:59:14 -0500 Subject: [PATCH 05/33] update checklist descriptions at tops of pages --- docs/wiki-guide/Code-Checklist.md | 6 +++++- docs/wiki-guide/Data-Checklist.md | 4 ++++ docs/wiki-guide/Model-Checklist.md | 4 ++++ 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/docs/wiki-guide/Code-Checklist.md b/docs/wiki-guide/Code-Checklist.md index 78482fc..4df9add 100644 --- a/docs/wiki-guide/Code-Checklist.md +++ b/docs/wiki-guide/Code-Checklist.md @@ -1,6 +1,10 @@ # Code Checklist -This checklist provides expectations for the code repositories created for the Experiential AI & Ecology Course (Spring Semester 2025). +This checklist provides an overview of essential and recommended elements to include in a GitHub repository to ensure that it conforms to FAIR principles and best practices for reproducibility. + +!!! tip "Pro tip" + + Use the eye icon at the top of this page to access the source and copy the markdown for the checklist above into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each element to your GitHub repository. ## Required Files - [ ] **License**: Verify and include an appropriate license (e.g., `MIT`, `CC0-1.0`, etc.). See discussion in the [guide](GitHub-Repo-Guide.md/#license). diff --git a/docs/wiki-guide/Data-Checklist.md b/docs/wiki-guide/Data-Checklist.md index 504509a..0ba1f62 100644 --- a/docs/wiki-guide/Data-Checklist.md +++ b/docs/wiki-guide/Data-Checklist.md @@ -1,6 +1,10 @@ # Dataset Card Checklist Below is a checklist encompassing all sections of a dataset card. Review notes and guidance provided in the full [datatset card template](HF_DatasetCard_Template_mkdocs.md/) for more details. +!!! tip "Pro tip" + + Use the eye icon at the top of this page to access the source and copy the markdown for the checklist above into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each element to your [dataset card](HF_DatasetCard_Template_mkdocs.md). + ## General Information - [ ] **License**: Verify and specify the license type (e.g., `cc0-1.0`). diff --git a/docs/wiki-guide/Model-Checklist.md b/docs/wiki-guide/Model-Checklist.md index 55464eb..27f9aa8 100644 --- a/docs/wiki-guide/Model-Checklist.md +++ b/docs/wiki-guide/Model-Checklist.md @@ -2,6 +2,10 @@ Below is a checklist encompassing all sections of a model card. Review notes and guidance provided in the full [model card template](HF_ModelCard_Template_mkdocs.md/) for more details. +!!! tip "Pro tip" + + Use the eye icon at the top of this page to access the source and copy the markdown for the checklist above into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each element to your [model card](HF_ModelCard_Template_mkdocs.md). + ## General Information - [ ] **Model Name**: Provide the name of the model. From 86f99bb29b24233bc59809fae0d8bc7d3ced1d6b Mon Sep 17 00:00:00 2001 From: egrace479 Date: Wed, 19 Mar 2025 18:50:48 -0400 Subject: [PATCH 06/33] Add link to repo issues to encourage dialog in case of questions/comments --- docs/wiki-guide/Code-Checklist.md | 2 ++ docs/wiki-guide/Data-Checklist.md | 4 +++- docs/wiki-guide/Model-Checklist.md | 2 ++ 3 files changed, 7 insertions(+), 1 deletion(-) diff --git a/docs/wiki-guide/Code-Checklist.md b/docs/wiki-guide/Code-Checklist.md index 4df9add..614cac5 100644 --- a/docs/wiki-guide/Code-Checklist.md +++ b/docs/wiki-guide/Code-Checklist.md @@ -106,3 +106,5 @@ The [Repo Guide](GitHub-Repo-Guide.md/) provides general guidance on repository - [ ] **Packaging**: Provide installation instructions (e.g., `setup.py`, `hatch`, `poetry`, `uv` for Python). - [ ] **Deployment Guide**: Document deployment procedures + +!!! question "[Questions, Comments, or Concerns?](https://github.com/Imageomics/Imageomics-guide/issues)" diff --git a/docs/wiki-guide/Data-Checklist.md b/docs/wiki-guide/Data-Checklist.md index 0ba1f62..9351c0a 100644 --- a/docs/wiki-guide/Data-Checklist.md +++ b/docs/wiki-guide/Data-Checklist.md @@ -39,7 +39,7 @@ Ex: All images are named `.png`, each within a folder named for the spec --- ## Dataset Creation -Refer to examples and explanations provided in the full [dataset card template](HF_DatasetCard_Template_mkdocs.md/#__codelineno-0-129). Much of this should have been filled out before leaving Hawaii. +Refer to examples and explanations provided in the full [dataset card template](HF_DatasetCard_Template_mkdocs.md/#__codelineno-0-129). - [ ] **Curation Rationale**: Explain why this dataset was created. - [ ] **Source Data**: Describe the source data. @@ -101,3 +101,5 @@ See discussion and references in the [template](HF_DatasetCard_Template_mkdocs.m ## Dataset Card Contact - [ ] **Contact Information**: [OPTIONAL] We recommend people use HF discussions, but you may indicate a person to contact. + +!!! question "[Questions, Comments, or Concerns?](https://github.com/Imageomics/Imageomics-guide/issues)" diff --git a/docs/wiki-guide/Model-Checklist.md b/docs/wiki-guide/Model-Checklist.md index 27f9aa8..f792c9b 100644 --- a/docs/wiki-guide/Model-Checklist.md +++ b/docs/wiki-guide/Model-Checklist.md @@ -122,3 +122,5 @@ See discussion and references in the [template](HF_ModelCard_Template_mkdocs.md/ - [ ] **Notes**: Include any other relevant details. - [ ] **Model Card Authors**: List contributors to the model card. - [ ] **Model Card Contact**: [OPTIONAL] We recommend people use HF discussions, but you may indicate a person to contact. + +!!! question "[Questions, Comments, or Concerns?](https://github.com/Imageomics/Imageomics-guide/issues)" From 6a0ba1ebe7a8e16ead2780b7a995ff0b2ca6c339 Mon Sep 17 00:00:00 2001 From: egrace479 Date: Tue, 8 Apr 2025 12:38:50 -0400 Subject: [PATCH 07/33] Add description of what to expect from using checklist --- docs/wiki-guide/Code-Checklist.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/wiki-guide/Code-Checklist.md b/docs/wiki-guide/Code-Checklist.md index 614cac5..cc85111 100644 --- a/docs/wiki-guide/Code-Checklist.md +++ b/docs/wiki-guide/Code-Checklist.md @@ -1,6 +1,7 @@ # Code Checklist -This checklist provides an overview of essential and recommended elements to include in a GitHub repository to ensure that it conforms to FAIR principles and best practices for reproducibility. +This checklist provides an overview of essential and recommended elements to include in a GitHub repository to ensure that it conforms to FAIR principles and best practices for reproducibility. Along with the generation of a DOI (see [DOI Generation](DOI-Generation.md) and [Digital Products Release and Licensing Policy](Digital-products-release-licensing-policy.md)), following this checklist ensures compliance with the FAIR Principles for research software.[^1] +[^1]: Barker, M., Chue Hong, N. P., Katz, D. S., Lamprecht, A. L., Martinez-Ortiz, C., Psomopoulos, F., Harrow, J., Castro, L. J., Gruenpeter, M., Martinez, P. A., & Honeyman, T. (2022). Introducing the FAIR Principles for research software. _Scientific data_, 9(1), 622. https://doi.org/10.1038/s41597-022-01710-x. !!! tip "Pro tip" From 6039addfb200104fedc9424455201d905e98383f Mon Sep 17 00:00:00 2001 From: egrace479 Date: Tue, 8 Apr 2025 12:39:40 -0400 Subject: [PATCH 08/33] Rename to checklist retain filename for consistency of links --- docs/wiki-guide/Metadata-Guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/wiki-guide/Metadata-Guide.md b/docs/wiki-guide/Metadata-Guide.md index d0851b5..1615001 100644 --- a/docs/wiki-guide/Metadata-Guide.md +++ b/docs/wiki-guide/Metadata-Guide.md @@ -1,4 +1,4 @@ -# Metadata Guide +# Metadata Checklist When collecting or compiling new data, there are generally questions one is _trying_ to answer. There are also often questions that will come up later—whether for yourself or others interested in using your data. From 9f4facf6b0ea5a1ebc96b2c982da7bd45a579240 Mon Sep 17 00:00:00 2001 From: egrace479 Date: Tue, 8 Apr 2025 12:40:32 -0400 Subject: [PATCH 09/33] Add help link at bottom of page keep consistent with other pages in section --- docs/wiki-guide/DOI-Generation.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/wiki-guide/DOI-Generation.md b/docs/wiki-guide/DOI-Generation.md index 9430c9b..08f1348 100644 --- a/docs/wiki-guide/DOI-Generation.md +++ b/docs/wiki-guide/DOI-Generation.md @@ -75,3 +75,4 @@ When creating a new record on Zenodo, please ensure that other members of your p [Dryad](https://datadryad.org/stash/about) is another research data repository, similar to Zenodo, through which one can archive digital objects (such as, but not limited to, data) supporting scholarly publications, and obtain a DOI. It has a review process when depositing data and requires dedication to the public domain (CC0) of all digital objects uploaded. Imageomics through OSU is a member organization of Dryad, reducing or eliminating data deposit charge(s). To determine whether Dryad is a suitable archive for Institute data products supporting your publication, please consider the [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) for more information, and consult with the Institute's Senior Data Scientist.[^1] +!!! question "[Questions, Comments, or Concerns?](https://github.com/Imageomics/Imageomics-guide/issues)" From 868ba65ed2995fc8196f00d543851bb6adfcff6d Mon Sep 17 00:00:00 2001 From: egrace479 Date: Tue, 8 Apr 2025 12:41:27 -0400 Subject: [PATCH 10/33] Add page explaining FAIR and providing context for checklists --- docs/wiki-guide/FAIR-Guide.md | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) create mode 100644 docs/wiki-guide/FAIR-Guide.md diff --git a/docs/wiki-guide/FAIR-Guide.md b/docs/wiki-guide/FAIR-Guide.md new file mode 100644 index 0000000..1ec8968 --- /dev/null +++ b/docs/wiki-guide/FAIR-Guide.md @@ -0,0 +1,33 @@ +# FAIR Guide + +This section provides information and resources to help ensure that digital products are ***F***indable ***A***ccessible ***I***nteroperable ***R***eusable and Reproducible. A general [Metadata Checklist](Metadata-Guide.md) is provided to start one thinking about the type of information to be collected. Additionally, we include checklists for [code](Code-Checklist.md), [data](Data-Checklist.md), and [model](Model-Checklist.md) repositories. The code checklist focuses on the contents of a well-documented GitHub repository, while the data and model checklists cover the content of the [data](HF_DatasetCard_Template_mkdocs.md/) and [model](HF_ModelCard_Template_mkdocs.md/) card templates, respectively. + +Each checklist was developed following the FAIR principles (as defined by the [Go-FAIR Initiative](https://www.go-fair.org/fair-principles/)). They provide a detailed outline of tasks and files to include to ensure alignment with the FAIR principles, and are complementary to the descriptions provided within the [GitHub](GitHub-Repo-Guide.md) and [Hugging Face](Hugging-Face-Repo-Guide.md) Guides presented on this site. As with the contents of these Guides, these checklists are based on a combination of existing guides (e.g., [The Turing Way](https://book.the-turing-way.org/), the [Model Card Guidebook](https://huggingface.co/docs/hub/en/model-card-annotated), and the [Dataset Card Creation Guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md)) and the experiences of our team. Following these checklists ensures digital products are aligned with FAIR principles and a best-effort toward reproducibility.[^1] + +!!! tip "Pro tip" + + Use the eye icon at the top of any checklist page to access the source and copy the markdown for the checklist into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each. When added to the main description of the issue, the issue summary will show _x_ out of total components completed for that issue. + +The last topic in this section discusses different methods of [DOI Generation](DOI-Generation.md) for digital products (code, data, and models). It focuses on our selected method for dataset publication: [Hugging Face](https://huggingface.co/), with some guidance on using [Zenodo](https://zenodo.org/) to archive code (specifically, a GitHub repository). For more information about other common data publication venues—and to see the thought process behind our selection—see the [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) for more information.[^2] Generating a DOI for a digital product is part of ensuring a globally unique and persistent identifier that can be used to reference and refer back to a digital product—an important component of FAIR and Reproducible principles. + +!!! info "References and Background" + If you want to learn more about FAIR and Reproducible principles, explore these resources that we used when creating this guide: + + - [The Turing Way](https://book.the-turing-way.org/): an open-source, community data science handbook. It provides a strong foundation on the guiding principles for _this_ Guide, providing accessible explanations and overviews of topics from [reproducibility](https://book.the-turing-way.org/reproducible-research/reproducible-research), to [collaboration](https://book.the-turing-way.org/collaboration/collaboration) and [communication](https://book.the-turing-way.org/communication/communication), to [project design](https://book.the-turing-way.org/project-design/project-design), to [ethical research](https://book.the-turing-way.org/ethical-research/ethical-research). + - This is a particularly good resource for those [just starting to use `git` and GitHub](https://book.the-turing-way.org/reproducible-research/vcs/vcs-git). It builds motivation for use of version control through the lens of reproducibility. + - Go-FAIR Initiative: [The FAIR Principles](https://www.go-fair.org/fair-principles/) + - Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. [https://huggingface.co/docs/hub/en/model-card-guidebook](https://huggingface.co/docs/hub/en/model-card-guidebook). + - They also provide a nice [summary of related work](https://huggingface.co/docs/hub/en/model-card-landscape-analysis), including [Datasheets for Datasets (Gebru, et al., 2018)](https://www.fatml.org/media/documents/datasheets_for_datasets.pdf) and The Dataset Nutrition Label ([label](https://datanutrition.org/labels/), [paper](https://arxiv.org/abs/1805.03677)). + - Wilkinson, M., Dumontier, M., Aalbersberg, I. _et al._ The FAIR Guiding Principles for scientific data management and stewardship. _Sci Data_ **3**, 160018 (2016). [10.1038/sdata.2016.18](https://doi.org/10.1038/sdata.2016.18) + - Barker, M., Chue Hong, N.P., Katz, D.S. _et al._ Introducing the FAIR Principles for research software. _Sci Data_ **9**, 622 (2022). [10.1038/s41597-022-01710-x](https://doi.org/10.1038/s41597-022-01710-x) + - Balk, M. A., Bradley, J., Maruf, M., Altintaş, B., Bakiş, Y., Bart, H. L. Jr, Breen, D., Florian, C. R., Greenberg, J., Karpatne, A., Karnani, K., Mabee, P., Pepper, J., Jebbia, D., Tabarin, T., Wang, X., & Lapp, H. (2024). A FAIR and modular image-based workflow for knowledge discovery in the emerging field of imageomics. _Methods in Ecology and Evolution_, 15, 1129–1145. [10.1111/2041-210X.14327](https://doi.org/10.1111/2041-210X.14327) + - The [FARR Research Coordination Network](https://www.farr-rcn.org/) has a number of interesting resources and events. + - The [Research Data Aliance for Interdisciplinary Research](https://www.rd-alliance.org/disciplines/rda-for-interdisciplinary-research/) also provides links to resources and events particularly focused on considerations in interdisciplinary research. + + + + +!!! question "[Questions, Comments, or Concerns?](https://github.com/Imageomics/Imageomics-guide/issues)" + +[^1]: Full reproducibility is difficult to achieve; this [presentation](https://drive.google.com/file/d/1BFqZ00zMuyVHaD9A8PvzRDEg7aV0kp3W/view?usp=drive_link) by Odd Erik Gundersen provides a discussion of the varying degrees of reproducibilityand useful references when considering the level of reproducibility achieved by a given project. +[^2]: The [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) was created in May 2023 when we were deciding Institute archive recommendations, so it does not include information about newer features such as [Hugging Face's dataset viewer](https://huggingface.co/docs/hub/en/datasets-viewer), which greatly simplifies previewing datasets for downstream users. From caf6e544a359303745dfeeeda7892d45e8c06383 Mon Sep 17 00:00:00 2001 From: egrace479 Date: Tue, 8 Apr 2025 12:42:25 -0400 Subject: [PATCH 11/33] Reformulate Metadata Guide as FAIR Guide include all checklists and new FAIR page in navigation renames Metadata Guide Page as Metadata Checklist for navigation bar --- mkdocs.yaml | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/mkdocs.yaml b/mkdocs.yaml index ea00471..b02b64f 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -102,8 +102,12 @@ nav: - "Workflow": wiki-guide/The-Hugging-Face-Workflow.md - "Dataset Upload Guide": wiki-guide/The-Hugging-Face-Dataset-Upload-Guide.md - "Why Use the Institute Hugging Face": wiki-guide/Why-use-the-Institute-Hugging-Face.md - - Metadata Guide: - - "Metadata Guide": wiki-guide/Metadata-Guide.md + - FAIR Guide: + - "FAIR Guide": wiki-guide/FAIR-Guide.md + - "Metadata Checklist": wiki-guide/Metadata-Guide.md + - "Code Repo Checklist": wiki-guide/Code-Checklist.md + - "Data Card Checklist": wiki-guide/Data-Checklist.md + - "Model Card Checklist": wiki-guide/Model-Checklist.md - "DOI Generation": wiki-guide/DOI-Generation.md - Templates: - "About Templates": wiki-guide/About-Templates.md From ef6701edc08c41e786e307e3bcb1720c771863cd Mon Sep 17 00:00:00 2001 From: egrace479 Date: Wed, 9 Apr 2025 11:07:42 -0400 Subject: [PATCH 12/33] Rename Metadata-Guide file to Metadata-Checklist for consistency with other checklists Adjust descriptions when referencing it accordingly --- docs/index.md | 2 +- docs/wiki-guide/Digital-products-release-licensing-policy.md | 2 +- docs/wiki-guide/FAIR-Guide.md | 2 +- docs/wiki-guide/{Metadata-Guide.md => Metadata-Checklist.md} | 0 mkdocs.yaml | 2 +- 5 files changed, 4 insertions(+), 4 deletions(-) rename docs/wiki-guide/{Metadata-Guide.md => Metadata-Checklist.md} (100%) diff --git a/docs/index.md b/docs/index.md index 8e3103f..a77dfa6 100644 --- a/docs/index.md +++ b/docs/index.md @@ -15,7 +15,7 @@ Check out our guides to get your project off on the right foot! - [The Hugging Face Repo Guide](wiki-guide/Hugging-Face-Repo-Guide.md): Analogous expected and suggested repository contents for Hugging Face repositories; there are notable differences from GitHub in both content and structure. -- [Metadata Guide](wiki-guide/Metadata-Guide.md): Guide to metadata collection and documentation. This closely follows our [HF Dataset Card Template](wiki-guide/HF_DatasetCard_Template_mkdocs.md) sections. +- [FAIR Guide](wiki-guide/FAIR-Guide.md): Guide to producing FAIR digital products, from metadata collection through product documentation and publication. This builds on the content in both the GitHub and Hugging Face Repository Guides, providing checklists to ensure [code](wiki-guide/Code-Checklist.md), [data](wiki-guide/Data-Checklist.md), and [model](wiki-guide/Model-Checklist.md) repositories are FAIR. The latter two closely follow our [HF Templates](wiki-guide/About-Templates.md). ### Project repo up, what's next? Check out our workflow guides for how to interact with your new repo: diff --git a/docs/wiki-guide/Digital-products-release-licensing-policy.md b/docs/wiki-guide/Digital-products-release-licensing-policy.md index f16d10b..15316f6 100644 --- a/docs/wiki-guide/Digital-products-release-licensing-policy.md +++ b/docs/wiki-guide/Digital-products-release-licensing-policy.md @@ -24,7 +24,7 @@ This means the following policy applies for digital products of the Imageomics I - For ML-ready datasets, for storage, version control, and sharing we recommend using [Hugging Face Dataset Hub](https://huggingface.co/docs/hub/datasets-overview), which provides for rich metadata description in the form of a [Dataset Card](HF_DatasetCard_Template_mkdocs.md). (See [Imageomics datasets](https://huggingface.co/imageomics) published there as examples.) - - Refer to the Imageomics [Hugging Face](Hugging-Face-Repo-Guide.md) and [Metadata](Metadata-Guide.md) guides for best-practices and further guidance. + - Refer to the Imageomics [Hugging Face](Hugging-Face-Repo-Guide.md) and [FAIR](FAIR-Guide.md) guides for best-practices and further guidance. 4. ML models are to be released under an [OSI-approved open source license](https://opensource.org/licenses/) or to the public domain (for example, by applying a [CC-Zero](https://creativecommons.org/publicdomain/zero/1.0/) waiver). In the case of potentially sensitive models or data (e.g., endangered species information), an Open [Responsible AI License](https://www.licenses.ai/ai-licenses) ([Open RAIL-M](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses)) may be considered. diff --git a/docs/wiki-guide/FAIR-Guide.md b/docs/wiki-guide/FAIR-Guide.md index 1ec8968..88b5801 100644 --- a/docs/wiki-guide/FAIR-Guide.md +++ b/docs/wiki-guide/FAIR-Guide.md @@ -1,6 +1,6 @@ # FAIR Guide -This section provides information and resources to help ensure that digital products are ***F***indable ***A***ccessible ***I***nteroperable ***R***eusable and Reproducible. A general [Metadata Checklist](Metadata-Guide.md) is provided to start one thinking about the type of information to be collected. Additionally, we include checklists for [code](Code-Checklist.md), [data](Data-Checklist.md), and [model](Model-Checklist.md) repositories. The code checklist focuses on the contents of a well-documented GitHub repository, while the data and model checklists cover the content of the [data](HF_DatasetCard_Template_mkdocs.md/) and [model](HF_ModelCard_Template_mkdocs.md/) card templates, respectively. +This section provides information and resources to help ensure that digital products are ***F***indable ***A***ccessible ***I***nteroperable ***R***eusable and Reproducible. A general [Metadata Checklist](Metadata-Checklist.md) is provided to start one thinking about the type of information to be collected. Additionally, we include checklists for [code](Code-Checklist.md), [data](Data-Checklist.md), and [model](Model-Checklist.md) repositories. The code checklist focuses on the contents of a well-documented GitHub repository, while the data and model checklists cover the content of the [data](HF_DatasetCard_Template_mkdocs.md/) and [model](HF_ModelCard_Template_mkdocs.md/) card templates, respectively. Each checklist was developed following the FAIR principles (as defined by the [Go-FAIR Initiative](https://www.go-fair.org/fair-principles/)). They provide a detailed outline of tasks and files to include to ensure alignment with the FAIR principles, and are complementary to the descriptions provided within the [GitHub](GitHub-Repo-Guide.md) and [Hugging Face](Hugging-Face-Repo-Guide.md) Guides presented on this site. As with the contents of these Guides, these checklists are based on a combination of existing guides (e.g., [The Turing Way](https://book.the-turing-way.org/), the [Model Card Guidebook](https://huggingface.co/docs/hub/en/model-card-annotated), and the [Dataset Card Creation Guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md)) and the experiences of our team. Following these checklists ensures digital products are aligned with FAIR principles and a best-effort toward reproducibility.[^1] diff --git a/docs/wiki-guide/Metadata-Guide.md b/docs/wiki-guide/Metadata-Checklist.md similarity index 100% rename from docs/wiki-guide/Metadata-Guide.md rename to docs/wiki-guide/Metadata-Checklist.md diff --git a/mkdocs.yaml b/mkdocs.yaml index b02b64f..344996a 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -104,7 +104,7 @@ nav: - "Why Use the Institute Hugging Face": wiki-guide/Why-use-the-Institute-Hugging-Face.md - FAIR Guide: - "FAIR Guide": wiki-guide/FAIR-Guide.md - - "Metadata Checklist": wiki-guide/Metadata-Guide.md + - "Metadata Checklist": wiki-guide/Metadata-Checklist.md - "Code Repo Checklist": wiki-guide/Code-Checklist.md - "Data Card Checklist": wiki-guide/Data-Checklist.md - "Model Card Checklist": wiki-guide/Model-Checklist.md From e211756ffe8a5ef4ecee48cd875bfe4ecd5b1a64 Mon Sep 17 00:00:00 2001 From: egrace479 Date: Wed, 9 Apr 2025 12:58:57 -0400 Subject: [PATCH 13/33] Remove course-specific aspect of note --- docs/wiki-guide/Code-Checklist.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/wiki-guide/Code-Checklist.md b/docs/wiki-guide/Code-Checklist.md index cc85111..3381342 100644 --- a/docs/wiki-guide/Code-Checklist.md +++ b/docs/wiki-guide/Code-Checklist.md @@ -48,13 +48,13 @@ This checklist provides an overview of essential and recommended elements to inc !!! note - The best practices described below will help you meet the above requirements. The more advanced development practices noted further down are included for educational purposes and since some groups have chosen to use linters. They are not required for this course; however, we recommend having a group discussion about the topics covered in [Code Quality](#code-quality). + The best practices described below will help you meet the above requirements. The more advanced development practices noted further down are included for educational purposes and are highly recommended—though these may go beyond what is expected for a given project, we advise collaborators to at least have a discussion about the topics covered in [Code Quality](#code-quality) and whether other practices discussed would be appropriate for their project. --- ## Best Practices -The [Repo Guide](GitHub-Repo-Guide.md/) provides general guidance on repository structure, [collaborative workflow](The-GitHub-Workflow.md/), and [how to make and review pull requests (PR)](The-GitHub-Pull-Request-Guide.md/). Below, we highlight some best practices in checklist form that are recommended for this course and beyond. +The [Repo Guide](GitHub-Repo-Guide.md/) provides general guidance on repository structure, [collaborative workflow](The-GitHub-Workflow.md/), and [how to make and review pull requests (PR)](The-GitHub-Pull-Request-Guide.md/). Below, we highlight some best practices in checklist form to help you meet the requirements described above for a FAIR and Reproducible project. ### Reproducibility From 0eba37ddfdbead903b10fea2a599d215c241826c Mon Sep 17 00:00:00 2001 From: Graham Taylor Date: Fri, 11 Apr 2025 11:01:15 -0400 Subject: [PATCH 14/33] chore: linting --- docs/wiki-guide/FAIR-Guide.md | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/docs/wiki-guide/FAIR-Guide.md b/docs/wiki-guide/FAIR-Guide.md index 88b5801..4359acc 100644 --- a/docs/wiki-guide/FAIR-Guide.md +++ b/docs/wiki-guide/FAIR-Guide.md @@ -4,7 +4,7 @@ This section provides information and resources to help ensure that digital prod Each checklist was developed following the FAIR principles (as defined by the [Go-FAIR Initiative](https://www.go-fair.org/fair-principles/)). They provide a detailed outline of tasks and files to include to ensure alignment with the FAIR principles, and are complementary to the descriptions provided within the [GitHub](GitHub-Repo-Guide.md) and [Hugging Face](Hugging-Face-Repo-Guide.md) Guides presented on this site. As with the contents of these Guides, these checklists are based on a combination of existing guides (e.g., [The Turing Way](https://book.the-turing-way.org/), the [Model Card Guidebook](https://huggingface.co/docs/hub/en/model-card-annotated), and the [Dataset Card Creation Guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md)) and the experiences of our team. Following these checklists ensures digital products are aligned with FAIR principles and a best-effort toward reproducibility.[^1] -!!! tip "Pro tip" +!!! tip "Pro tip" Use the eye icon at the top of any checklist page to access the source and copy the markdown for the checklist into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each. When added to the main description of the issue, the issue summary will show _x_ out of total components completed for that issue. @@ -24,9 +24,6 @@ The last topic in this section discusses different methods of [DOI Generation](D - The [FARR Research Coordination Network](https://www.farr-rcn.org/) has a number of interesting resources and events. - The [Research Data Aliance for Interdisciplinary Research](https://www.rd-alliance.org/disciplines/rda-for-interdisciplinary-research/) also provides links to resources and events particularly focused on considerations in interdisciplinary research. - - - !!! question "[Questions, Comments, or Concerns?](https://github.com/Imageomics/Imageomics-guide/issues)" [^1]: Full reproducibility is difficult to achieve; this [presentation](https://drive.google.com/file/d/1BFqZ00zMuyVHaD9A8PvzRDEg7aV0kp3W/view?usp=drive_link) by Odd Erik Gundersen provides a discussion of the varying degrees of reproducibilityand useful references when considering the level of reproducibility achieved by a given project. From cec31d894959f30ebae84a57595d6e585c84697b Mon Sep 17 00:00:00 2001 From: Graham Taylor Date: Fri, 11 Apr 2025 13:44:53 -0400 Subject: [PATCH 15/33] Update FAIR Guide navigation to clarify section title --- mkdocs.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mkdocs.yaml b/mkdocs.yaml index 344996a..11bd1e2 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -103,7 +103,7 @@ nav: - "Dataset Upload Guide": wiki-guide/The-Hugging-Face-Dataset-Upload-Guide.md - "Why Use the Institute Hugging Face": wiki-guide/Why-use-the-Institute-Hugging-Face.md - FAIR Guide: - - "FAIR Guide": wiki-guide/FAIR-Guide.md + - "About FAIR Principles": wiki-guide/FAIR-Guide.md - "Metadata Checklist": wiki-guide/Metadata-Checklist.md - "Code Repo Checklist": wiki-guide/Code-Checklist.md - "Data Card Checklist": wiki-guide/Data-Checklist.md From 71cdd3bdcd7ba0812036c849dc6da055cd47105a Mon Sep 17 00:00:00 2001 From: Graham Taylor Date: Fri, 11 Apr 2025 13:45:08 -0400 Subject: [PATCH 16/33] Clarify reproducibility context in FAIR Guide and update references --- docs/wiki-guide/FAIR-Guide.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/docs/wiki-guide/FAIR-Guide.md b/docs/wiki-guide/FAIR-Guide.md index 4359acc..22b0c9f 100644 --- a/docs/wiki-guide/FAIR-Guide.md +++ b/docs/wiki-guide/FAIR-Guide.md @@ -1,14 +1,14 @@ # FAIR Guide -This section provides information and resources to help ensure that digital products are ***F***indable ***A***ccessible ***I***nteroperable ***R***eusable and Reproducible. A general [Metadata Checklist](Metadata-Checklist.md) is provided to start one thinking about the type of information to be collected. Additionally, we include checklists for [code](Code-Checklist.md), [data](Data-Checklist.md), and [model](Model-Checklist.md) repositories. The code checklist focuses on the contents of a well-documented GitHub repository, while the data and model checklists cover the content of the [data](HF_DatasetCard_Template_mkdocs.md/) and [model](HF_ModelCard_Template_mkdocs.md/) card templates, respectively. +This section provides information and resources to help ensure that digital products are ***F***indable ***A***ccessible ***I***nteroperable ***R***eusable and Reproducible[^1]. A general [Metadata Checklist](Metadata-Checklist.md) is provided to start one thinking about the type of information to be collected. Additionally, we include checklists for [code](Code-Checklist.md), [data](Data-Checklist.md), and [model](Model-Checklist.md) repositories. The code checklist focuses on the contents of a well-documented GitHub repository, while the data and model checklists cover the content of the [data](HF_DatasetCard_Template_mkdocs.md/) and [model](HF_ModelCard_Template_mkdocs.md/) card templates, respectively. -Each checklist was developed following the FAIR principles (as defined by the [Go-FAIR Initiative](https://www.go-fair.org/fair-principles/)). They provide a detailed outline of tasks and files to include to ensure alignment with the FAIR principles, and are complementary to the descriptions provided within the [GitHub](GitHub-Repo-Guide.md) and [Hugging Face](Hugging-Face-Repo-Guide.md) Guides presented on this site. As with the contents of these Guides, these checklists are based on a combination of existing guides (e.g., [The Turing Way](https://book.the-turing-way.org/), the [Model Card Guidebook](https://huggingface.co/docs/hub/en/model-card-annotated), and the [Dataset Card Creation Guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md)) and the experiences of our team. Following these checklists ensures digital products are aligned with FAIR principles and a best-effort toward reproducibility.[^1] +Each checklist was developed following the FAIR principles (as defined by the [Go-FAIR Initiative](https://www.go-fair.org/fair-principles/)). They provide a detailed outline of tasks and files to include to ensure alignment with the FAIR principles, and are complementary to the descriptions provided within the [GitHub](GitHub-Repo-Guide.md) and [Hugging Face](Hugging-Face-Repo-Guide.md) Guides presented on this site. As with the contents of these Guides, these checklists are based on a combination of existing guides (e.g., [The Turing Way](https://book.the-turing-way.org/), the [Model Card Guidebook](https://huggingface.co/docs/hub/en/model-card-annotated), and the [Dataset Card Creation Guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md)) and the experiences of our team. Following these checklists ensures digital products are aligned with FAIR principles and a best-effort toward reproducibility.[^2] !!! tip "Pro tip" Use the eye icon at the top of any checklist page to access the source and copy the markdown for the checklist into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each. When added to the main description of the issue, the issue summary will show _x_ out of total components completed for that issue. -The last topic in this section discusses different methods of [DOI Generation](DOI-Generation.md) for digital products (code, data, and models). It focuses on our selected method for dataset publication: [Hugging Face](https://huggingface.co/), with some guidance on using [Zenodo](https://zenodo.org/) to archive code (specifically, a GitHub repository). For more information about other common data publication venues—and to see the thought process behind our selection—see the [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) for more information.[^2] Generating a DOI for a digital product is part of ensuring a globally unique and persistent identifier that can be used to reference and refer back to a digital product—an important component of FAIR and Reproducible principles. +The last topic in this section discusses different methods of [DOI Generation](DOI-Generation.md) for digital products (code, data, and models). It focuses on our selected method for dataset publication: [Hugging Face](https://huggingface.co/), with some guidance on using [Zenodo](https://zenodo.org/) to archive code (specifically, a GitHub repository). For more information about other common data publication venues—and to see the thought process behind our selection—see the [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) for more information.[^3] Generating a DOI for a digital product is part of ensuring a globally unique and persistent identifier that can be used to reference and refer back to a digital product—an important component of FAIR and Reproducible principles. !!! info "References and Background" If you want to learn more about FAIR and Reproducible principles, explore these resources that we used when creating this guide: @@ -26,5 +26,6 @@ The last topic in this section discusses different methods of [DOI Generation](D !!! question "[Questions, Comments, or Concerns?](https://github.com/Imageomics/Imageomics-guide/issues)" -[^1]: Full reproducibility is difficult to achieve; this [presentation](https://drive.google.com/file/d/1BFqZ00zMuyVHaD9A8PvzRDEg7aV0kp3W/view?usp=drive_link) by Odd Erik Gundersen provides a discussion of the varying degrees of reproducibilityand useful references when considering the level of reproducibility achieved by a given project. -[^2]: The [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) was created in May 2023 when we were deciding Institute archive recommendations, so it does not include information about newer features such as [Hugging Face's dataset viewer](https://huggingface.co/docs/hub/en/datasets-viewer), which greatly simplifies previewing datasets for downstream users. +[^1]: While "Reproducible" is not part of the original FAIR principles as defined by the [Go-FAIR Initiative](https://www.go-fair.org/fair-principles/), we include it here to emphasize the importance of computational reproducibility alongside data stewardship. This extension reflects emerging practice in data-intensive science, where code, models, and workflows must be reusable and verifiable to support robust scientific claims. It is not part of the formal FAIR acronym, but aligns with broader community goals for open and transparent research. +[^2]: Full reproducibility is difficult to achieve; this [presentation](https://drive.google.com/file/d/1BFqZ00zMuyVHaD9A8PvzRDEg7aV0kp3W/view?usp=drive_link) by Odd Erik Gundersen provides a discussion of the varying degrees of reproducibilityand useful references when considering the level of reproducibility achieved by a given project. +[^3]: The [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) was created in May 2023 when we were deciding Institute archive recommendations, so it does not include information about newer features such as [Hugging Face's dataset viewer](https://huggingface.co/docs/hub/en/datasets-viewer), which greatly simplifies previewing datasets for downstream users. From 5382a0f81a9cfe966c2b35604782cfc0594a8db1 Mon Sep 17 00:00:00 2001 From: Graham Taylor Date: Fri, 11 Apr 2025 14:15:27 -0400 Subject: [PATCH 17/33] lint Metadata-Checklist.md --- docs/wiki-guide/Metadata-Checklist.md | 31 ++++++++++++++------------- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/docs/wiki-guide/Metadata-Checklist.md b/docs/wiki-guide/Metadata-Checklist.md index 1615001..36cca85 100644 --- a/docs/wiki-guide/Metadata-Checklist.md +++ b/docs/wiki-guide/Metadata-Checklist.md @@ -1,6 +1,6 @@ # Metadata Checklist -When collecting or compiling new data, there are generally questions one is _trying_ to answer. There are also often questions that will come up later—whether for yourself or others interested in using your data. +When collecting or compiling new data, there are generally questions one is _trying_ to answer. There are also often questions that will come up later—whether for yourself or others interested in using your data. To improve both the _**Findability**_ and _**Reusability**_ of your data (ensuring [FAIR principles](Glossary-for-Imageomics.md#fair-data-principles)) for yourself and others, be sure to note down the following information. @@ -8,35 +8,36 @@ To improve both the _**Findability**_ and _**Reusability**_ of your data (ensuri Be sure to include any other information that may be important to your particular project or field. See, for instance, the [Code](Code-Checklist.md), [Data](Data-Checklist.md), and [Model](Model-Checklist.md) Checklists included in this section. ## Checklist for Metadata to Record + - [ ] **Description:** Summary of your data, for instance: - - What are the contents of the data (images, text, type of animal)? - - Is it machine-ready? - - Where did it come from (Source)? + - What are the contents of the data (images, text, type of animal)? + - Is it machine-ready? + - Where did it come from (Source)? - [ ] **Data Sources:** Machine-readable sources of the data (links or other files). - [ ] **License Information:** This is part of retaining records of a data source (eg., museum images, previous dataset). A record of licenses on the images must be retained to ensure they are respected. If dealing with CC licenses, please see this [OSU Library CC best practices guide](https://library.osu.edu/sites/default/files/2022-10/attributing_cc_license_flyer_2022_ac.pdf). - [ ] **Dataset Structure:** - - Organization of the full dataset (eg., file structure). - - Feature information: Information available for each image, such as species and subspecies designations, location information, etc. - - Instance information: Image type (jpg, tiff, png), number of pixels per image, coloring (RGB, UV), presence of scale or color indicators (ruler or ColorChecker), etc. + - Organization of the full dataset (eg., file structure). + - Feature information: Information available for each image, such as species and subspecies designations, location information, etc. + - Instance information: Image type (jpg, tiff, png), number of pixels per image, coloring (RGB, UV), presence of scale or color indicators (ruler or ColorChecker), etc. - [ ] **Processing Steps:** List modifications performed (as they're done) and include links to the code used (this _should_ be organized somewhere, like a GitHub repository). - - Similarly, include any annotation process information. + - Similarly, include any annotation process information. - [ ] **Tasks:** What could this dataset be used for (eg., image classification, feature extraction, image segmentation, etc.). - [ ] **Curation Rationale:** Why are you collecting and/or modifying this data? - - This ties into the question of tasks it could be applied to, both to help maintain the group focus, and increase the likelihood others interested in answering similar questions will be able to find and use your data. -- [ ] **Author:** The curator(s)/editor(s) of the data. Assumes sufficient modification of the data by you (and your team) or that you have collected it. - - If thinking about publishing the data, add ORCID to all Authors; these can be looked up on [orcid.org](https://orcid.org/). -- [ ] **Related Publication:** Any papers that are based on this dataset. + - This ties into the question of tasks it could be applied to, both to help maintain the group focus, and increase the likelihood others interested in answering similar questions will be able to find and use your data. +- [ ] **Author:** The curator(s)/editor(s) of the data. Assumes sufficient modification of the data by you (and your team) or that you have collected it. + - If thinking about publishing the data, add ORCID to all Authors; these can be looked up on [orcid.org](https://orcid.org/). +- [ ] **Related Publication:** Any papers that are based on this dataset. - [ ] **Related Datasets:** Provide links to any related datasets (may include previous/background research). - [ ] **Other References:** Links to any related/background articles. - [ ] **Keywords/Tags:** Terms one might search to find this dataset, eg., type(s) of animals, type(s) of images, imbalanced (if not even distribution of species/subspecies/etc). - - It helps to keep a running list. + - It helps to keep a running list. - [ ] **Notes:** Any other image/data information. -!!! warning "Remember" +!!! warning "Remember" Datasets **_cannot_** be redistributed without this information. -!!! tip "Pro tip" +!!! tip "Pro tip" Use the eye icon at the top of this page to access the source and copy the markdown for the checklist above into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each. From 5ea502ff6e7374dcc89128fb5a39e9717076d885 Mon Sep 17 00:00:00 2001 From: Graham Taylor Date: Fri, 11 Apr 2025 14:16:51 -0400 Subject: [PATCH 18/33] minor edits FAIR-Guide.md and Metadata-Checklist.md --- docs/wiki-guide/FAIR-Guide.md | 2 +- docs/wiki-guide/Metadata-Checklist.md | 12 ++++++------ 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/wiki-guide/FAIR-Guide.md b/docs/wiki-guide/FAIR-Guide.md index 22b0c9f..f182e50 100644 --- a/docs/wiki-guide/FAIR-Guide.md +++ b/docs/wiki-guide/FAIR-Guide.md @@ -17,7 +17,7 @@ The last topic in this section discusses different methods of [DOI Generation](D - This is a particularly good resource for those [just starting to use `git` and GitHub](https://book.the-turing-way.org/reproducible-research/vcs/vcs-git). It builds motivation for use of version control through the lens of reproducibility. - Go-FAIR Initiative: [The FAIR Principles](https://www.go-fair.org/fair-principles/) - Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. [https://huggingface.co/docs/hub/en/model-card-guidebook](https://huggingface.co/docs/hub/en/model-card-guidebook). - - They also provide a nice [summary of related work](https://huggingface.co/docs/hub/en/model-card-landscape-analysis), including [Datasheets for Datasets (Gebru, et al., 2018)](https://www.fatml.org/media/documents/datasheets_for_datasets.pdf) and The Dataset Nutrition Label ([label](https://datanutrition.org/labels/), [paper](https://arxiv.org/abs/1805.03677)). + - The authors also provide a nice [summary of related work](https://huggingface.co/docs/hub/en/model-card-landscape-analysis), including [Datasheets for Datasets (Gebru, et al., 2018)](https://www.fatml.org/media/documents/datasheets_for_datasets.pdf) and The Dataset Nutrition Label ([label](https://datanutrition.org/labels/), [paper](https://arxiv.org/abs/1805.03677)). - Wilkinson, M., Dumontier, M., Aalbersberg, I. _et al._ The FAIR Guiding Principles for scientific data management and stewardship. _Sci Data_ **3**, 160018 (2016). [10.1038/sdata.2016.18](https://doi.org/10.1038/sdata.2016.18) - Barker, M., Chue Hong, N.P., Katz, D.S. _et al._ Introducing the FAIR Principles for research software. _Sci Data_ **9**, 622 (2022). [10.1038/s41597-022-01710-x](https://doi.org/10.1038/s41597-022-01710-x) - Balk, M. A., Bradley, J., Maruf, M., Altintaş, B., Bakiş, Y., Bart, H. L. Jr, Breen, D., Florian, C. R., Greenberg, J., Karpatne, A., Karnani, K., Mabee, P., Pepper, J., Jebbia, D., Tabarin, T., Wang, X., & Lapp, H. (2024). A FAIR and modular image-based workflow for knowledge discovery in the emerging field of imageomics. _Methods in Ecology and Evolution_, 15, 1129–1145. [10.1111/2041-210X.14327](https://doi.org/10.1111/2041-210X.14327) diff --git a/docs/wiki-guide/Metadata-Checklist.md b/docs/wiki-guide/Metadata-Checklist.md index 36cca85..2d9a1a5 100644 --- a/docs/wiki-guide/Metadata-Checklist.md +++ b/docs/wiki-guide/Metadata-Checklist.md @@ -14,22 +14,22 @@ To improve both the _**Findability**_ and _**Reusability**_ of your data (ensuri - Is it machine-ready? - Where did it come from (Source)? - [ ] **Data Sources:** Machine-readable sources of the data (links or other files). -- [ ] **License Information:** This is part of retaining records of a data source (eg., museum images, previous dataset). A record of licenses on the images must be retained to ensure they are respected. If dealing with CC licenses, please see this [OSU Library CC best practices guide](https://library.osu.edu/sites/default/files/2022-10/attributing_cc_license_flyer_2022_ac.pdf). +- [ ] **License Information:** This is part of retaining records of a data source (e.g., museum images, previous dataset). A record of licenses on the images must be retained to ensure they are respected. If dealing with CC licenses, please see this [OSU Library CC best practices guide](https://library.osu.edu/sites/default/files/2022-10/attributing_cc_license_flyer_2022_ac.pdf). - [ ] **Dataset Structure:** - - Organization of the full dataset (eg., file structure). + - Organization of the full dataset (e.g., file structure). - Feature information: Information available for each image, such as species and subspecies designations, location information, etc. - - Instance information: Image type (jpg, tiff, png), number of pixels per image, coloring (RGB, UV), presence of scale or color indicators (ruler or ColorChecker), etc. + - Instance information: Image type (jpg, tiff, png), number of pixels per image, color space (RGB, UV), presence of scale or color indicators (ruler or ColorChecker), etc. - [ ] **Processing Steps:** List modifications performed (as they're done) and include links to the code used (this _should_ be organized somewhere, like a GitHub repository). - Similarly, include any annotation process information. -- [ ] **Tasks:** What could this dataset be used for (eg., image classification, feature extraction, image segmentation, etc.). +- [ ] **Tasks:** What could this dataset be used for (e.g., image classification, feature extraction, image segmentation, etc.). - [ ] **Curation Rationale:** Why are you collecting and/or modifying this data? - This ties into the question of tasks it could be applied to, both to help maintain the group focus, and increase the likelihood others interested in answering similar questions will be able to find and use your data. - [ ] **Author:** The curator(s)/editor(s) of the data. Assumes sufficient modification of the data by you (and your team) or that you have collected it. - If thinking about publishing the data, add ORCID to all Authors; these can be looked up on [orcid.org](https://orcid.org/). -- [ ] **Related Publication:** Any papers that are based on this dataset. +- [ ] **Related Publications:** Any papers that are based on this dataset. - [ ] **Related Datasets:** Provide links to any related datasets (may include previous/background research). - [ ] **Other References:** Links to any related/background articles. -- [ ] **Keywords/Tags:** Terms one might search to find this dataset, eg., type(s) of animals, type(s) of images, imbalanced (if not even distribution of species/subspecies/etc). +- [ ] **Keywords/Tags:** Terms one might search to find this dataset, e.g., type(s) of animals, type(s) of images, imbalanced (if not even distribution of species/subspecies/etc). - It helps to keep a running list. - [ ] **Notes:** Any other image/data information. From 0bb6d90a5c6f56bfeaab2dd1ea015471f7d04001 Mon Sep 17 00:00:00 2001 From: Graham Taylor Date: Fri, 11 Apr 2025 14:28:59 -0400 Subject: [PATCH 19/33] lint Code-Checklist.md --- docs/wiki-guide/Code-Checklist.md | 42 ++++++++++++++----------------- 1 file changed, 19 insertions(+), 23 deletions(-) diff --git a/docs/wiki-guide/Code-Checklist.md b/docs/wiki-guide/Code-Checklist.md index 3381342..d60b643 100644 --- a/docs/wiki-guide/Code-Checklist.md +++ b/docs/wiki-guide/Code-Checklist.md @@ -1,31 +1,34 @@ # Code Checklist This checklist provides an overview of essential and recommended elements to include in a GitHub repository to ensure that it conforms to FAIR principles and best practices for reproducibility. Along with the generation of a DOI (see [DOI Generation](DOI-Generation.md) and [Digital Products Release and Licensing Policy](Digital-products-release-licensing-policy.md)), following this checklist ensures compliance with the FAIR Principles for research software.[^1] -[^1]: Barker, M., Chue Hong, N. P., Katz, D. S., Lamprecht, A. L., Martinez-Ortiz, C., Psomopoulos, F., Harrow, J., Castro, L. J., Gruenpeter, M., Martinez, P. A., & Honeyman, T. (2022). Introducing the FAIR Principles for research software. _Scientific data_, 9(1), 622. https://doi.org/10.1038/s41597-022-01710-x. +[^1]: Barker, M., Chue Hong, N. P., Katz, D. S., Lamprecht, A. L., Martinez-Ortiz, C., Psomopoulos, F., Harrow, J., Castro, L. J., Gruenpeter, M., Martinez, P. A., & Honeyman, T. (2022). Introducing the FAIR Principles for research software. _Scientific data_, 9(1), 622. [URL](https://doi.org/10.1038/s41597-022-01710-x). -!!! tip "Pro tip" +!!! tip "Pro tip" Use the eye icon at the top of this page to access the source and copy the markdown for the checklist above into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each element to your GitHub repository. ## Required Files + - [ ] **License**: Verify and include an appropriate license (e.g., `MIT`, `CC0-1.0`, etc.). See discussion in the [guide](GitHub-Repo-Guide.md/#license). - [ ] **README File**: Following the [guide](GitHub-Repo-Guide.md/#readme), provide a detailed `README.md` with: - - [ ] Overview of the project. - - [ ] Installation instructions. - - [ ] Basic usage examples. - - [ ] Links to related/created dataset(s). - - [ ] Links to related/created model(s). - - [ ] Acknowledge source code dependencies and contributors. - - [ ] Reference related datasets used in training or evaluation. + - [ ] Overview of the project. + - [ ] Installation instructions. + - [ ] Basic usage examples. + - [ ] Links to related/created dataset(s). + - [ ] Links to related/created model(s). + - [ ] Acknowledge source code dependencies and contributors. + - [ ] Reference related datasets used in training or evaluation. - [ ] **Requirements File**: Provide a [file detailing software requirements](GitHub-Repo-Guide.md/#software-requirements-file), such as a `requirements.txt` or `pyproject.toml` for Python dependencies. - [ ] **Gitignore File**: GitHub has premade `.gitignore` files ([here](https://github.com/github/gitignore)) tailored to particular languages (eg., [R](https://github.com/github/gitignore/blob/main/R.gitignore) or [Python](https://github.com/github/gitignore/blob/main/Python.gitignore)), operating systems, etc. - [ ] **CITATION CFF**: This facilitates citation of your work, follow guidance provided in the [guide](GitHub-Repo-Guide.md/#citation). ### Data-Related + - [ ] Preprocessing code. - [ ] Description of dataset(s), including description of training and testing sets (with links to relevant portions of dataset card, which will have more information). ### Model-Related + - [ ] Training code. - [ ] Inference/evaluation code. - [ ] Model weights (if not in Hugging Face model repository). @@ -41,12 +44,10 @@ This checklist provides an overview of essential and recommended elements to inc - [ ] **Code Comments**: Include meaningful inline comments and function descriptions for clarity. - [ ] **Random Seed Control**: Save random seeds to ensure reproducible results. - ## Security Considerations - [ ] **Sensitive Data Handling**: Ensure no hardcoded sensitive information (e.g., API keys, credentials) are included in your repository. These can be shared through a config file on OSC. - !!! note The best practices described below will help you meet the above requirements. The more advanced development practices noted further down are included for educational purposes and are highly recommended—though these may go beyond what is expected for a given project, we advise collaborators to at least have a discussion about the topics covered in [Code Quality](#code-quality) and whether other practices discussed would be appropriate for their project. @@ -62,14 +63,12 @@ The [Repo Guide](GitHub-Repo-Guide.md/) provides general guidance on repository - **Modularization**: Structure code into reusable and independent modules. - **Code Execution**: Provide Notebooks to demonstrate how to reproduce results. - ### Code Review & Maintenance - **Code Reviews**: Regular peer reviews for quality assurance. Refer to the [GitHub PR Review Guide](The-GitHub-Pull-Request-Guide.md/#2-review-a-pull-request). -- **Issue Tracking**: Use GitHub issues for tracking bugs and feature requests. +- **Issue Tracking**: Use GitHub issues for tracking bugs and feature requests. - **Versioning**: Tag releases, changelogs can be auto-generated and informative when PRs are appropriately scoped. - ### Installation and Dependencies - [ ] **Environment Setup**: Include setup instructions (e.g., `conda` environment file, `Dockerfile`). @@ -77,33 +76,30 @@ The [Repo Guide](GitHub-Repo-Guide.md/) provides general guidance on repository --- -# More Advanced Development +## More Advanced Development -## Documentation +### Documentation -- [ ] **API Documentation**: Generate API documentation (e.g., `MkDocs` for Python or wiki pages in the repo). +- [ ] **API Documentation**: Generate API documentation (e.g., `MkDocs` for Python or wiki pages in the repo). - [ ] **Docstrings**: Add comprehensive docstrings for all functions, classes, and modules. These can be incorporated to help generate documentation. - [ ] **Example Scripts**: Include example scripts for common use cases. - [ ] **Configuration Files**: Use `yaml`, `json`, or `ini` for configuration settings. - -## Code Quality +### Code Quality - [ ] **Consistent Style**: Follow coding style guidelines (e.g., `PEP 8` for Python). - [ ] **Linting**: Ensure the code passes a linter (e.g., `Ruff` for Python). - [ ] **Logging**: Use logging instead of print statements for better debugging (e.g., `logging` in Python). - [ ] **Error Handling**: Implement robust exception handling to avoid crashes. - -## Testing +### Testing - [ ] **Unit Tests**: Write unit tests to validate core functionality. - [ ] **Integration Tests**: Ensure components work together correctly. - [ ] **Test Coverage**: Check test coverage - [ ] **Continuous Integration (CI)**: Set up CI/CD pipelines (e.g., GitHub Actions) for automated testing. - -## Code Distribution & Deployment +### Code Distribution & Deployment - [ ] **Packaging**: Provide installation instructions (e.g., `setup.py`, `hatch`, `poetry`, `uv` for Python). - [ ] **Deployment Guide**: Document deployment procedures From 15e9bb749ef949a37856391c42a1cf3e9af7262f Mon Sep 17 00:00:00 2001 From: Graham Taylor Date: Fri, 11 Apr 2025 15:07:38 -0400 Subject: [PATCH 20/33] fix: move nested checklist to 4 spaces indent for Python-Markdown compatibility This seems to be related to [this issue](https://github.com/mkdocs/mkdocs/issues/545), [this issue](https://github.com/Python-Markdown/markdown/issues/3) and [this issue](https://github.com/Python-Markdown/markdown/issues/451) --- docs/wiki-guide/Code-Checklist.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/wiki-guide/Code-Checklist.md b/docs/wiki-guide/Code-Checklist.md index d60b643..fa33bc3 100644 --- a/docs/wiki-guide/Code-Checklist.md +++ b/docs/wiki-guide/Code-Checklist.md @@ -11,13 +11,13 @@ This checklist provides an overview of essential and recommended elements to inc - [ ] **License**: Verify and include an appropriate license (e.g., `MIT`, `CC0-1.0`, etc.). See discussion in the [guide](GitHub-Repo-Guide.md/#license). - [ ] **README File**: Following the [guide](GitHub-Repo-Guide.md/#readme), provide a detailed `README.md` with: - - [ ] Overview of the project. - - [ ] Installation instructions. - - [ ] Basic usage examples. - - [ ] Links to related/created dataset(s). - - [ ] Links to related/created model(s). - - [ ] Acknowledge source code dependencies and contributors. - - [ ] Reference related datasets used in training or evaluation. + - [ ] Overview of the project. + - [ ] Installation instructions. + - [ ] Basic usage examples. + - [ ] Links to related/created dataset(s). + - [ ] Links to related/created model(s). + - [ ] Acknowledge source code dependencies and contributors. + - [ ] Reference related datasets used in training or evaluation. - [ ] **Requirements File**: Provide a [file detailing software requirements](GitHub-Repo-Guide.md/#software-requirements-file), such as a `requirements.txt` or `pyproject.toml` for Python dependencies. - [ ] **Gitignore File**: GitHub has premade `.gitignore` files ([here](https://github.com/github/gitignore)) tailored to particular languages (eg., [R](https://github.com/github/gitignore/blob/main/R.gitignore) or [Python](https://github.com/github/gitignore/blob/main/Python.gitignore)), operating systems, etc. - [ ] **CITATION CFF**: This facilitates citation of your work, follow guidance provided in the [guide](GitHub-Repo-Guide.md/#citation). From 9485411b031d38bc45027790c2826c53e1df9cea Mon Sep 17 00:00:00 2001 From: Graham Taylor Date: Fri, 11 Apr 2025 15:49:29 -0400 Subject: [PATCH 21/33] revert to 4 space indentation due to issues with Python-Markdown and nested lists --- .markdownlint.json | 5 +++++ docs/wiki-guide/Metadata-Checklist.md | 20 ++++++++++---------- 2 files changed, 15 insertions(+), 10 deletions(-) create mode 100644 .markdownlint.json diff --git a/.markdownlint.json b/.markdownlint.json new file mode 100644 index 0000000..8d238b9 --- /dev/null +++ b/.markdownlint.json @@ -0,0 +1,5 @@ +{ + "MD007": { "indent": 4 }, + "no-hard-tabs": false, + "MD013": false +} \ No newline at end of file diff --git a/docs/wiki-guide/Metadata-Checklist.md b/docs/wiki-guide/Metadata-Checklist.md index 2d9a1a5..e82bfab 100644 --- a/docs/wiki-guide/Metadata-Checklist.md +++ b/docs/wiki-guide/Metadata-Checklist.md @@ -10,27 +10,27 @@ To improve both the _**Findability**_ and _**Reusability**_ of your data (ensuri ## Checklist for Metadata to Record - [ ] **Description:** Summary of your data, for instance: - - What are the contents of the data (images, text, type of animal)? - - Is it machine-ready? - - Where did it come from (Source)? + - What are the contents of the data (images, text, type of animal)? + - Is it machine-ready? + - Where did it come from (Source)? - [ ] **Data Sources:** Machine-readable sources of the data (links or other files). - [ ] **License Information:** This is part of retaining records of a data source (e.g., museum images, previous dataset). A record of licenses on the images must be retained to ensure they are respected. If dealing with CC licenses, please see this [OSU Library CC best practices guide](https://library.osu.edu/sites/default/files/2022-10/attributing_cc_license_flyer_2022_ac.pdf). - [ ] **Dataset Structure:** - - Organization of the full dataset (e.g., file structure). - - Feature information: Information available for each image, such as species and subspecies designations, location information, etc. - - Instance information: Image type (jpg, tiff, png), number of pixels per image, color space (RGB, UV), presence of scale or color indicators (ruler or ColorChecker), etc. + - Organization of the full dataset (e.g., file structure). + - Feature information: Information available for each image, such as species and subspecies designations, location information, etc. + - Instance information: Image type (jpg, tiff, png), number of pixels per image, color space (RGB, UV), presence of scale or color indicators (ruler or ColorChecker), etc. - [ ] **Processing Steps:** List modifications performed (as they're done) and include links to the code used (this _should_ be organized somewhere, like a GitHub repository). - - Similarly, include any annotation process information. + - Similarly, include any annotation process information. - [ ] **Tasks:** What could this dataset be used for (e.g., image classification, feature extraction, image segmentation, etc.). - [ ] **Curation Rationale:** Why are you collecting and/or modifying this data? - - This ties into the question of tasks it could be applied to, both to help maintain the group focus, and increase the likelihood others interested in answering similar questions will be able to find and use your data. + - This ties into the question of tasks it could be applied to, both to help maintain the group focus, and increase the likelihood others interested in answering similar questions will be able to find and use your data. - [ ] **Author:** The curator(s)/editor(s) of the data. Assumes sufficient modification of the data by you (and your team) or that you have collected it. - - If thinking about publishing the data, add ORCID to all Authors; these can be looked up on [orcid.org](https://orcid.org/). + - If thinking about publishing the data, add ORCID to all Authors; these can be looked up on [orcid.org](https://orcid.org/). - [ ] **Related Publications:** Any papers that are based on this dataset. - [ ] **Related Datasets:** Provide links to any related datasets (may include previous/background research). - [ ] **Other References:** Links to any related/background articles. - [ ] **Keywords/Tags:** Terms one might search to find this dataset, e.g., type(s) of animals, type(s) of images, imbalanced (if not even distribution of species/subspecies/etc). - - It helps to keep a running list. + - It helps to keep a running list. - [ ] **Notes:** Any other image/data information. !!! warning "Remember" From 4126ec3f0fabbaa0d76be553a920c366cca5af28 Mon Sep 17 00:00:00 2001 From: Graham Taylor Date: Fri, 11 Apr 2025 15:54:13 -0400 Subject: [PATCH 22/33] minor edits to Code Repo Checklist --- docs/wiki-guide/Code-Checklist.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/wiki-guide/Code-Checklist.md b/docs/wiki-guide/Code-Checklist.md index fa33bc3..34cc323 100644 --- a/docs/wiki-guide/Code-Checklist.md +++ b/docs/wiki-guide/Code-Checklist.md @@ -9,8 +9,8 @@ This checklist provides an overview of essential and recommended elements to inc ## Required Files -- [ ] **License**: Verify and include an appropriate license (e.g., `MIT`, `CC0-1.0`, etc.). See discussion in the [guide](GitHub-Repo-Guide.md/#license). -- [ ] **README File**: Following the [guide](GitHub-Repo-Guide.md/#readme), provide a detailed `README.md` with: +- [ ] **License**: Verify and include an appropriate license (e.g., `MIT`, `CC0-1.0`, etc.). See discussion in the [Repo Guide](GitHub-Repo-Guide.md/#license). +- [ ] **README File**: Following the [Repo Guide](GitHub-Repo-Guide.md/#readme), provide a detailed `README.md` with: - [ ] Overview of the project. - [ ] Installation instructions. - [ ] Basic usage examples. @@ -20,7 +20,7 @@ This checklist provides an overview of essential and recommended elements to inc - [ ] Reference related datasets used in training or evaluation. - [ ] **Requirements File**: Provide a [file detailing software requirements](GitHub-Repo-Guide.md/#software-requirements-file), such as a `requirements.txt` or `pyproject.toml` for Python dependencies. - [ ] **Gitignore File**: GitHub has premade `.gitignore` files ([here](https://github.com/github/gitignore)) tailored to particular languages (eg., [R](https://github.com/github/gitignore/blob/main/R.gitignore) or [Python](https://github.com/github/gitignore/blob/main/Python.gitignore)), operating systems, etc. -- [ ] **CITATION CFF**: This facilitates citation of your work, follow guidance provided in the [guide](GitHub-Repo-Guide.md/#citation). +- [ ] **CITATION CFF**: This facilitates citation of your work, follow guidance provided in the [Repo Guide](GitHub-Repo-Guide.md/#citation). ### Data-Related @@ -40,7 +40,7 @@ This checklist provides an overview of essential and recommended elements to inc ## General Information -- [ ] **Repository Structure**: Ensure the code repository follows a clear and logical directory structure. (See [guide](GitHub-Repo-Guide.md/#general-repository-structure).) +- [ ] **Repository Structure**: Ensure the code repository follows a clear and logical directory structure. (See [Repo Guide](GitHub-Repo-Guide.md/#general-repository-structure).) - [ ] **Code Comments**: Include meaningful inline comments and function descriptions for clarity. - [ ] **Random Seed Control**: Save random seeds to ensure reproducible results. @@ -72,7 +72,7 @@ The [Repo Guide](GitHub-Repo-Guide.md/) provides general guidance on repository ### Installation and Dependencies - [ ] **Environment Setup**: Include setup instructions (e.g., `conda` environment file, `Dockerfile`). -- [ ] **Dependency Management**: Use virtual environments (e.g., `venv`, `conda`, `uv` for Python) to isolate dependencies. +- [ ] **Dependency Management**: Use virtual environments and the frameworks that manage them (e.g., `venv`, `conda`, `uv` for Python) to isolate dependencies. --- From 729648f488c7f5390df3121e1b74f3db4d03ea9b Mon Sep 17 00:00:00 2001 From: Graham Taylor Date: Fri, 11 Apr 2025 15:58:27 -0400 Subject: [PATCH 23/33] fix: lint Data-Checklist.md --- docs/wiki-guide/Data-Checklist.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/docs/wiki-guide/Data-Checklist.md b/docs/wiki-guide/Data-Checklist.md index 9351c0a..36c1f0c 100644 --- a/docs/wiki-guide/Data-Checklist.md +++ b/docs/wiki-guide/Data-Checklist.md @@ -1,7 +1,8 @@ # Dataset Card Checklist + Below is a checklist encompassing all sections of a dataset card. Review notes and guidance provided in the full [datatset card template](HF_DatasetCard_Template_mkdocs.md/) for more details. -!!! tip "Pro tip" +!!! tip "Pro tip" Use the eye icon at the top of this page to access the source and copy the markdown for the checklist above into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each element to your [dataset card](HF_DatasetCard_Template_mkdocs.md). @@ -39,6 +40,7 @@ Ex: All images are named `.png`, each within a folder named for the spec --- ## Dataset Creation + Refer to examples and explanations provided in the full [dataset card template](HF_DatasetCard_Template_mkdocs.md/#__codelineno-0-129). - [ ] **Curation Rationale**: Explain why this dataset was created. @@ -53,14 +55,16 @@ Refer to examples and explanations provided in the full [dataset card template]( --- ## Considerations for Using the Data + Things to consider while working with the dataset. For instance, maybe there are hybrids and they are labeled in the `hybrid_stat` column, so to get a subset without hybrids, subset to all instances in the metadata file such that `hybrid_stat` is _not_ "hybrid". -- [ ] **Bias, Risks, and Limitations**: Describe any known issues with the dataset. For instance, if your data exhibits a long-tailed distribution (and why). +- [ ] **Bias, Risks, and Limitations**: Describe any known issues with the dataset. For instance, if your data exhibits a long-tailed distribution (and why). - [ ] **Recommendations**: Provide recommendations for using the dataset responsibly. --- ## Licensing Information + See discussion and references in the [template](HF_DatasetCard_Template_mkdocs.md/#__codelineno-0-19), also remember the [digital product release and licensing policy](Digital-products-release-licensing-policy.md/). - [ ] **Licensing Details**: Confirm and list all licensing details. From 8249e0255ce0aee37f1330e9b33f226f664a508b Mon Sep 17 00:00:00 2001 From: Graham Taylor Date: Fri, 11 Apr 2025 16:06:34 -0400 Subject: [PATCH 24/33] minor edits to Data-Checklist.md --- docs/wiki-guide/Data-Checklist.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/wiki-guide/Data-Checklist.md b/docs/wiki-guide/Data-Checklist.md index 36c1f0c..84b8a3f 100644 --- a/docs/wiki-guide/Data-Checklist.md +++ b/docs/wiki-guide/Data-Checklist.md @@ -1,6 +1,6 @@ # Dataset Card Checklist -Below is a checklist encompassing all sections of a dataset card. Review notes and guidance provided in the full [datatset card template](HF_DatasetCard_Template_mkdocs.md/) for more details. +Below is a checklist encompassing all sections of a dataset card. Review notes and guidance provided in the full [dataset card template](HF_DatasetCard_Template_mkdocs.md/) for more details. !!! tip "Pro tip" @@ -8,12 +8,12 @@ Below is a checklist encompassing all sections of a dataset card. Review notes a ## General Information -- [ ] **License**: Verify and specify the license type (e.g., `cc0-1.0`). +- [ ] **License**: Verify and specify the license type (e.g., `CC0-1.0`). - [ ] **Language**: Indicate the language(s) (e.g., `en`). - [ ] **Pretty Name**: Provide a descriptive name for the dataset. -- [ ] **Task Categories**: List relevant task categories (e.g., image-classification). Refer to [task categories](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/pipelines.ts). +- [ ] **Task Categories**: List relevant task categories (e.g., image-classification). Refer to [the coarse-grained taxonomy of task categories as well as subtasks in this file](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/pipelines.ts). - [ ] **Tags**: Include relevant tags (e.g., `biology`, `image`, `animals`, `CV`). -- [ ] **Size Categories**: Specify dataset size (e.g., `n<1K`, `1K.png`, each within a folder named for the species. They are 1024 x 1024, and the color has been standardized using ``. +E.g.: All images are named `.png`, each within a folder named for the species. They are 1024 x 1024, and the color has been standardized using ``. - [ ] **Data Fields**: Describe the types of the data files or the columns in a CSV with metadata ([example](HF_DatasetCard_Template_mkdocs.md/#__codelineno-0-114)). - [ ] **Data Splits**: Describe any splits (e.g., train, test, validation). @@ -56,10 +56,11 @@ Refer to examples and explanations provided in the full [dataset card template]( ## Considerations for Using the Data -Things to consider while working with the dataset. For instance, maybe there are hybrids and they are labeled in the `hybrid_stat` column, so to get a subset without hybrids, subset to all instances in the metadata file such that `hybrid_stat` is _not_ "hybrid". +There are several things to consider while working with the dataset that should be reported to users. For instance, maybe there are hybrids and they are labeled in the `hybrid_stat` column, so to get a subset without hybrids, subset to all instances in the metadata file such that `hybrid_stat` is _not_ "hybrid". - [ ] **Bias, Risks, and Limitations**: Describe any known issues with the dataset. For instance, if your data exhibits a long-tailed distribution (and why). - [ ] **Recommendations**: Provide recommendations for using the dataset responsibly. +- [ ] **Reporting issues**: Provide a link to the issue tracker or other mechanism for reporting problems (e.g. mislabeling, corrupted images, etc.). --- From 7140de0abe6d907a35d567fe209982c909db762c Mon Sep 17 00:00:00 2001 From: Graham Taylor Date: Fri, 11 Apr 2025 16:08:45 -0400 Subject: [PATCH 25/33] fix: lint Model-Checklist.md --- docs/wiki-guide/Model-Checklist.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/wiki-guide/Model-Checklist.md b/docs/wiki-guide/Model-Checklist.md index f792c9b..6727194 100644 --- a/docs/wiki-guide/Model-Checklist.md +++ b/docs/wiki-guide/Model-Checklist.md @@ -2,7 +2,7 @@ Below is a checklist encompassing all sections of a model card. Review notes and guidance provided in the full [model card template](HF_ModelCard_Template_mkdocs.md/) for more details. -!!! tip "Pro tip" +!!! tip "Pro tip" Use the eye icon at the top of this page to access the source and copy the markdown for the checklist above into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each element to your [model card](HF_ModelCard_Template_mkdocs.md). @@ -42,7 +42,7 @@ Below is a checklist encompassing all sections of a model card. Review notes and ## Bias, Risks, and Limitations - [ ] **Bias, Risks, and Limitations**: Discuss potential biases and in the model, along with possible mitigations. -- [ ] **Recommendations**: Provide responsible usage recommendations with respect to the bias, risk, and technical limitations. +- [ ] **Recommendations**: Provide responsible usage recommendations with respect to the bias, risk, and technical limitations. --- @@ -64,6 +64,7 @@ Below is a checklist encompassing all sections of a model card. Review notes and --- ## Evaluation + This section describes the evaluation protocols and provides the results. - [ ] **Testing Data**: Describe the dataset used for testing. This should link to a Dataset Card if possible, otherwise link to the original source with more info. @@ -98,6 +99,7 @@ This section describes the evaluation protocols and provides the results. --- ## Licensing and Citation + See discussion and references in the [template](HF_ModelCard_Template_mkdocs.md/#__codelineno-0-19), also remember the [digital product release and licensing policy](Digital-products-release-licensing-policy.md/). - [ ] **License**: Confirm licensing details. From 4de950622ddc7c43b8eb7d9a72502ef242fb0ee5 Mon Sep 17 00:00:00 2001 From: Graham Taylor Date: Fri, 11 Apr 2025 16:15:27 -0400 Subject: [PATCH 26/33] minor edits to Model-Checklist.md --- docs/wiki-guide/Model-Checklist.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/wiki-guide/Model-Checklist.md b/docs/wiki-guide/Model-Checklist.md index 6727194..7b90bce 100644 --- a/docs/wiki-guide/Model-Checklist.md +++ b/docs/wiki-guide/Model-Checklist.md @@ -10,10 +10,10 @@ Below is a checklist encompassing all sections of a model card. Review notes and - [ ] **Model Name**: Provide the name of the model. - [ ] **Model Summary**: Provide a quick summary of what the model is/does -- [ ] **License**: Choose an appropriate license (e.g., `cc0-1.0`). +- [ ] **License**: Choose an appropriate license (e.g., `CC0-1.0`). - [ ] **Language(s)**: Specify the language(s) used (e.g., `en`). - [ ] **Tags**: Include relevant tags (e.g., `biology`, `CV`, `images`, `animals`). -- [ ] **Datasets**: List datasets used for training, linking if hosted on Hugging Face. Ex: imageomics/TreeOfLife-10M +- [ ] **Datasets**: List datasets used for training, linking if hosted on Hugging Face. E.g.: imageomics/TreeOfLife-10M - [ ] **Metrics**: Specify key evaluation metrics (refer to [Hugging Face metrics list](https://hf.co/metrics)). --- @@ -93,7 +93,9 @@ This section describes the evaluation protocols and provides the results. ## Technical Specifications -- [ ] **Model Architecture**: Provide a detailed architecture description and the objective behind. +- [ ] **Model Architecture**: Provide a detailed architecture description and the choices behind its selection. +- [ ] **Performance Metrics**: List performance metrics and their significance. +- [ ] **Model Size**: Specify the model size in MB. - [ ] **Compute Requirements**: List hardware and software requirements. --- From f626ad69611520cbdbb6ea0fe77badf9422079f3 Mon Sep 17 00:00:00 2001 From: Graham Taylor Date: Fri, 11 Apr 2025 16:22:04 -0400 Subject: [PATCH 27/33] fix: lint DOI-Generation.md --- docs/wiki-guide/DOI-Generation.md | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/docs/wiki-guide/DOI-Generation.md b/docs/wiki-guide/DOI-Generation.md index 08f1348..883f3b0 100644 --- a/docs/wiki-guide/DOI-Generation.md +++ b/docs/wiki-guide/DOI-Generation.md @@ -7,25 +7,22 @@ You are likely familiar with DOIs from citing (journal/arXiv/conference) papers, A DOI (Digital Object Identifier) is a _persistent_ (permanent) digital identifier for any object (data, model, code, etc.) that _uniquely_ distinguishes it from other objects and links to information—metadata—about the object. The International DOI Foundation (IDF) is responsible for developing and administering the DOI system. See their [What is a DOI](https://www.doi.org/the-identifier/what-is-a-doi/) article for more information. - ## How do you generate a DOI? When publishing code, data, or models, there are various options for DOI generation, and selecting one is generally dependent on where the object of interest is published. We will go over the two standard methods used by the Institute here, and we mention a third option for completeness. A comparison of these three options is provided in the [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf). - ### 1. Generate a DOI on Hugging Face -This is the simplest method for generating a DOI for a model or dataset since [Hugging Face partnered with DataCite to offer this option](https://huggingface.co/blog/introducing-doi). +This is the simplest method for generating a DOI for a model or dataset since [Hugging Face partnered with DataCite to offer this option](https://huggingface.co/blog/introducing-doi). !!! warning "Warning" - Though it is a very simple process, it is not one to be taken lightly, as there is no removing data once this has been done--any changes require generation of a ***new*** DOI for the updated version: the old version will be maintained in perpetuity! + Though it is a very simple process, it is not one to be taken lightly, as there is no removing data once this has been done--any changes require generation of a _**new**_ DOI for the updated version: the old version will be maintained in perpetuity! !!! warning "Warning" As stated in the [Imageomics Digital Products Release and Licensing Policy](Digital-products-release-licensing-policy.md), DOIs are not to be generated for Imageomics Organization Repositories until approval has been granted by the Senior Data Scientist or Institute Leadership. Hugging Face allows for the generation of a DOI through the settings tab on the Model or Dataset. For details on _how_ to generate a DOI with Hugging Face, please see the [Hugging Face DOI Documentation](https://huggingface.co/docs/hub/doi). - ### 2. Generate a DOI with Zenodo This is the most common method used for generating a DOI for a GitHub repository, because [Zenodo](https://zenodo.org/) has a [GitHub integration](https://zenodo.org/account/settings/github/), which is accessed through your Zenodo account settings (for more information, please see [GitHub's associated Docs](https://docs.github.com/articles/referencing-and-citing-content)). Zenodo can also be used to generate DOIs for data, as is relatively common in biology. However, for direct use of ML models and datasets, there are many more advantages to using Hugging Face; please see the [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) for more information.[^1] @@ -38,11 +35,11 @@ When your GitHub and Zenodo accounts are linked, there will be a list of availab ![Zenodo instructions and enabled repos](images/doi-generation/enabled_repos+intstructions.png){ loading=lazy, width="800" } !!! info "The Sync now button" - There is a "Sync now" button at the top right of the instructions, with information on when the last sync occurred. Observe that a badge appears for the enabled repository that _has_ a DOI, while the one without just shows up as enabled; this will also be true for repositories to which you have access but that you did not submit to Zenodo yourself. + There is a "Sync now" button at the top right of the instructions, with information on when the last sync occurred. Observe that a badge appears for the enabled repository that **_has_** a DOI, while the one without just shows up as enabled; this will also be true for repositories to which you have access but that you did not submit to Zenodo yourself. #### Metadata Tracking -When automatically generating a DOI with Zenodo, it uses information provided in your `CITATION.cff` file to populate the metadata for the record. However, there is important information that is not supported through this integration despite its inclusion in the `CITATION.cff` format in some cases. +When automatically generating a DOI with Zenodo, it uses information provided in your `CITATION.cff` file to populate the metadata for the record. However, there is important information that is not supported through this integration despite its inclusion in the `CITATION.cff` format in some cases. If your repository is likely to be updated repeatedly (i.e., generating new releases), then you may consider adding a `.zenodo.json` to preserve the remaining metadata on release sync with Zenodo for DOI. This metadata includes grant (funding) information, references (which may be included in your `CITATION.cff`), and a description of your repository/code. @@ -70,7 +67,6 @@ Building on the alternate edit options, there is also the option to simply gener When creating a new record on Zenodo, please ensure that other members of your project have access, as appropriate. In particular, there should be at least one member of Institute leadership or the Senior Data Scientist added to the record with management permissions. This ensures the ability to maintain the metadata and address matters related to the record (which may extend beyond your tenure with the Institute) in a timely manner. - ### 3. Generate a DOI with Dryad [Dryad](https://datadryad.org/stash/about) is another research data repository, similar to Zenodo, through which one can archive digital objects (such as, but not limited to, data) supporting scholarly publications, and obtain a DOI. It has a review process when depositing data and requires dedication to the public domain (CC0) of all digital objects uploaded. Imageomics through OSU is a member organization of Dryad, reducing or eliminating data deposit charge(s). To determine whether Dryad is a suitable archive for Institute data products supporting your publication, please consider the [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) for more information, and consult with the Institute's Senior Data Scientist.[^1] From 37b4ceb7648d760596d5569a4c1243991a5a832b Mon Sep 17 00:00:00 2001 From: Graham Taylor Date: Fri, 11 Apr 2025 16:26:21 -0400 Subject: [PATCH 28/33] fix: improve flow of intro to DOI guide --- docs/wiki-guide/DOI-Generation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/wiki-guide/DOI-Generation.md b/docs/wiki-guide/DOI-Generation.md index 883f3b0..54aea60 100644 --- a/docs/wiki-guide/DOI-Generation.md +++ b/docs/wiki-guide/DOI-Generation.md @@ -1,7 +1,7 @@ # DOI Generation This guide discusses DOI generation for digital artifacts that may be associated with publications, such as datasets, models, and software. -You are likely familiar with DOIs from citing (journal/arXiv/conference) papers, for which they are generated by the publisher and regularly used in citations. However, they are also invaluable for proper citation of code, models, and data. One may think of this in the manner they are handled on arXiv, where there are options for "Cite as:" or "for this version" (with the "v#" at the end) option when citing a preprint. +You are likely familiar with DOIs from citing (journal/arXiv/conference) papers, for which they are generated by the publisher and regularly used in citations. However, they are also invaluable for proper citation of code, models, and data. Similar to how DOIs help track different versions of preprints on repositories like arXiv, they can provide persistent identification and versioning for your research artifacts beyond traditional publications. ## What is a DOI? From dfddaa027e0d1a3a5eb4921ea8d5fac9846a2b81 Mon Sep 17 00:00:00 2001 From: Graham Taylor Date: Fri, 11 Apr 2025 16:33:32 -0400 Subject: [PATCH 29/33] minor edits to DOI-Generation.md --- docs/wiki-guide/DOI-Generation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/wiki-guide/DOI-Generation.md b/docs/wiki-guide/DOI-Generation.md index 54aea60..281f983 100644 --- a/docs/wiki-guide/DOI-Generation.md +++ b/docs/wiki-guide/DOI-Generation.md @@ -5,7 +5,7 @@ You are likely familiar with DOIs from citing (journal/arXiv/conference) papers, ## What is a DOI? -A DOI (Digital Object Identifier) is a _persistent_ (permanent) digital identifier for any object (data, model, code, etc.) that _uniquely_ distinguishes it from other objects and links to information—metadata—about the object. The International DOI Foundation (IDF) is responsible for developing and administering the DOI system. See their [What is a DOI](https://www.doi.org/the-identifier/what-is-a-doi/) article for more information. +A DOI (Digital Object Identifier) is a _persistent_ (permanent) digital identifier for any object (data, model, code, etc.) that _uniquely_ distinguishes it from other objects and links to information—metadata—about the object. The International DOI Foundation (IDF) is responsible for developing and administering the DOI system. See their [What is a DOI?](https://www.doi.org/the-identifier/what-is-a-doi/) article for more information. ## How do you generate a DOI? From a9ce38ca3d3ff89e0ec6e2f51c2ce475e7d36a2f Mon Sep 17 00:00:00 2001 From: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com> Date: Thu, 24 Apr 2025 12:57:02 -0400 Subject: [PATCH 30/33] Ensure examples are viable format for HF cards The Hugging Face interface specifically requires lowercase for licenses in the yaml portion. --- docs/wiki-guide/Data-Checklist.md | 2 +- docs/wiki-guide/Model-Checklist.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/wiki-guide/Data-Checklist.md b/docs/wiki-guide/Data-Checklist.md index 84b8a3f..e8326b9 100644 --- a/docs/wiki-guide/Data-Checklist.md +++ b/docs/wiki-guide/Data-Checklist.md @@ -8,7 +8,7 @@ Below is a checklist encompassing all sections of a dataset card. Review notes a ## General Information -- [ ] **License**: Verify and specify the license type (e.g., `CC0-1.0`). +- [ ] **License**: Verify and specify the license type (e.g., `cc0-1.0`). - [ ] **Language**: Indicate the language(s) (e.g., `en`). - [ ] **Pretty Name**: Provide a descriptive name for the dataset. - [ ] **Task Categories**: List relevant task categories (e.g., image-classification). Refer to [the coarse-grained taxonomy of task categories as well as subtasks in this file](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/pipelines.ts). diff --git a/docs/wiki-guide/Model-Checklist.md b/docs/wiki-guide/Model-Checklist.md index 7b90bce..a6c7530 100644 --- a/docs/wiki-guide/Model-Checklist.md +++ b/docs/wiki-guide/Model-Checklist.md @@ -10,7 +10,7 @@ Below is a checklist encompassing all sections of a model card. Review notes and - [ ] **Model Name**: Provide the name of the model. - [ ] **Model Summary**: Provide a quick summary of what the model is/does -- [ ] **License**: Choose an appropriate license (e.g., `CC0-1.0`). +- [ ] **License**: Choose an appropriate license (e.g., `cc0-1.0`). - [ ] **Language(s)**: Specify the language(s) used (e.g., `en`). - [ ] **Tags**: Include relevant tags (e.g., `biology`, `CV`, `images`, `animals`). - [ ] **Datasets**: List datasets used for training, linking if hosted on Hugging Face. E.g.: imageomics/TreeOfLife-10M From 7abbf6ddc3079a733fb148011f9fc60e65f9ca59 Mon Sep 17 00:00:00 2001 From: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com> Date: Thu, 24 Apr 2025 12:57:36 -0400 Subject: [PATCH 31/33] Add clarification on what constitutes an issue tracker for HF --- docs/wiki-guide/Data-Checklist.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/wiki-guide/Data-Checklist.md b/docs/wiki-guide/Data-Checklist.md index e8326b9..4948f5a 100644 --- a/docs/wiki-guide/Data-Checklist.md +++ b/docs/wiki-guide/Data-Checklist.md @@ -60,7 +60,7 @@ There are several things to consider while working with the dataset that should - [ ] **Bias, Risks, and Limitations**: Describe any known issues with the dataset. For instance, if your data exhibits a long-tailed distribution (and why). - [ ] **Recommendations**: Provide recommendations for using the dataset responsibly. -- [ ] **Reporting issues**: Provide a link to the issue tracker or other mechanism for reporting problems (e.g. mislabeling, corrupted images, etc.). +- [ ] **Reporting issues**: Provide a link to the issue tracker or other mechanism for reporting problems (e.g. mislabeling, corrupted images, etc.). This can simply be the Community tab for the repository or Issues on the associated GitHub repository. --- From eddffa92a4d182f90ce4e052eaa8ba795b115934 Mon Sep 17 00:00:00 2001 From: egrace479 Date: Fri, 25 Apr 2025 11:10:56 -0400 Subject: [PATCH 32/33] Adjust sub-bullet formatting for clearer comment on the source --- docs/wiki-guide/FAIR-Guide.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/wiki-guide/FAIR-Guide.md b/docs/wiki-guide/FAIR-Guide.md index f182e50..80b9161 100644 --- a/docs/wiki-guide/FAIR-Guide.md +++ b/docs/wiki-guide/FAIR-Guide.md @@ -14,10 +14,12 @@ The last topic in this section discusses different methods of [DOI Generation](D If you want to learn more about FAIR and Reproducible principles, explore these resources that we used when creating this guide: - [The Turing Way](https://book.the-turing-way.org/): an open-source, community data science handbook. It provides a strong foundation on the guiding principles for _this_ Guide, providing accessible explanations and overviews of topics from [reproducibility](https://book.the-turing-way.org/reproducible-research/reproducible-research), to [collaboration](https://book.the-turing-way.org/collaboration/collaboration) and [communication](https://book.the-turing-way.org/communication/communication), to [project design](https://book.the-turing-way.org/project-design/project-design), to [ethical research](https://book.the-turing-way.org/ethical-research/ethical-research). - - This is a particularly good resource for those [just starting to use `git` and GitHub](https://book.the-turing-way.org/reproducible-research/vcs/vcs-git). It builds motivation for use of version control through the lens of reproducibility. + + _This is a particularly good resource for those [just starting to use `git` and GitHub](https://book.the-turing-way.org/reproducible-research/vcs/vcs-git). It builds motivation for use of version control through the lens of reproducibility._ - Go-FAIR Initiative: [The FAIR Principles](https://www.go-fair.org/fair-principles/) - Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. [https://huggingface.co/docs/hub/en/model-card-guidebook](https://huggingface.co/docs/hub/en/model-card-guidebook). - - The authors also provide a nice [summary of related work](https://huggingface.co/docs/hub/en/model-card-landscape-analysis), including [Datasheets for Datasets (Gebru, et al., 2018)](https://www.fatml.org/media/documents/datasheets_for_datasets.pdf) and The Dataset Nutrition Label ([label](https://datanutrition.org/labels/), [paper](https://arxiv.org/abs/1805.03677)). + + _The authors also provide a nice [summary of related work](https://huggingface.co/docs/hub/en/model-card-landscape-analysis), including [Datasheets for Datasets (Gebru, et al., 2018)](https://www.fatml.org/media/documents/datasheets_for_datasets.pdf) and The Dataset Nutrition Label ([label](https://datanutrition.org/labels/), [paper](https://arxiv.org/abs/1805.03677))._ - Wilkinson, M., Dumontier, M., Aalbersberg, I. _et al._ The FAIR Guiding Principles for scientific data management and stewardship. _Sci Data_ **3**, 160018 (2016). [10.1038/sdata.2016.18](https://doi.org/10.1038/sdata.2016.18) - Barker, M., Chue Hong, N.P., Katz, D.S. _et al._ Introducing the FAIR Principles for research software. _Sci Data_ **9**, 622 (2022). [10.1038/s41597-022-01710-x](https://doi.org/10.1038/s41597-022-01710-x) - Balk, M. A., Bradley, J., Maruf, M., Altintaş, B., Bakiş, Y., Bart, H. L. Jr, Breen, D., Florian, C. R., Greenberg, J., Karpatne, A., Karnani, K., Mabee, P., Pepper, J., Jebbia, D., Tabarin, T., Wang, X., & Lapp, H. (2024). A FAIR and modular image-based workflow for knowledge discovery in the emerging field of imageomics. _Methods in Ecology and Evolution_, 15, 1129–1145. [10.1111/2041-210X.14327](https://doi.org/10.1111/2041-210X.14327) From bb676d7eeee64fbd5054f57d9fcb2d4732e2bf70 Mon Sep 17 00:00:00 2001 From: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com> Date: Tue, 3 Jun 2025 16:41:54 -0400 Subject: [PATCH 33/33] Add more helpful references fix ref relative location improve some phrasing Co-authored-by: Hilmar Lapp --- docs/wiki-guide/Code-Checklist.md | 14 +++++++------- docs/wiki-guide/Data-Checklist.md | 2 +- docs/wiki-guide/FAIR-Guide.md | 6 +++--- docs/wiki-guide/Metadata-Checklist.md | 2 +- docs/wiki-guide/Model-Checklist.md | 2 +- 5 files changed, 13 insertions(+), 13 deletions(-) diff --git a/docs/wiki-guide/Code-Checklist.md b/docs/wiki-guide/Code-Checklist.md index 34cc323..b3c3569 100644 --- a/docs/wiki-guide/Code-Checklist.md +++ b/docs/wiki-guide/Code-Checklist.md @@ -5,7 +5,7 @@ This checklist provides an overview of essential and recommended elements to inc !!! tip "Pro tip" - Use the eye icon at the top of this page to access the source and copy the markdown for the checklist above into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each element to your GitHub repository. + Use the eye icon at the top of this page to access the source and copy the markdown for the checklist below into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each element to your GitHub repository. ## Required Files @@ -42,7 +42,7 @@ This checklist provides an overview of essential and recommended elements to inc - [ ] **Repository Structure**: Ensure the code repository follows a clear and logical directory structure. (See [Repo Guide](GitHub-Repo-Guide.md/#general-repository-structure).) - [ ] **Code Comments**: Include meaningful inline comments and function descriptions for clarity. -- [ ] **Random Seed Control**: Save random seeds to ensure reproducible results. +- [ ] **Random Seed Control**: Save seed(s) for random number generator(s) to ensure reproducible results. ## Security Considerations @@ -80,8 +80,8 @@ The [Repo Guide](GitHub-Repo-Guide.md/) provides general guidance on repository ### Documentation -- [ ] **API Documentation**: Generate API documentation (e.g., `MkDocs` for Python or wiki pages in the repo). -- [ ] **Docstrings**: Add comprehensive docstrings for all functions, classes, and modules. These can be incorporated to help generate documentation. +- [ ] **API Documentation**: Generate API documentation (e.g., [`MkDocs`](https://www.mkdocs.org) for Python or wiki pages in the repo). +- [ ] **Docstrings**: Add comprehensive docstrings for all functions, classes, and modules. These can be incorporated to help generate documentation. Note that generative AI tools with access to your code, such as GitHub Copilot, can be quite accurate in generating these, especially if you are using type annotations. - [ ] **Example Scripts**: Include example scripts for common use cases. - [ ] **Configuration Files**: Use `yaml`, `json`, or `ini` for configuration settings. @@ -90,14 +90,14 @@ The [Repo Guide](GitHub-Repo-Guide.md/) provides general guidance on repository - [ ] **Consistent Style**: Follow coding style guidelines (e.g., `PEP 8` for Python). - [ ] **Linting**: Ensure the code passes a linter (e.g., `Ruff` for Python). - [ ] **Logging**: Use logging instead of print statements for better debugging (e.g., `logging` in Python). -- [ ] **Error Handling**: Implement robust exception handling to avoid crashes. +- [ ] **Error Handling**: Implement robust exception handling to avoid crashes or bogus results from input outside of code expectations. ### Testing - [ ] **Unit Tests**: Write unit tests to validate core functionality. - [ ] **Integration Tests**: Ensure components work together correctly. -- [ ] **Test Coverage**: Check test coverage -- [ ] **Continuous Integration (CI)**: Set up CI/CD pipelines (e.g., GitHub Actions) for automated testing. +- [ ] **Test Coverage**: Check test coverage, e.g., using [Coverage](https://coverage.readthedocs.io/). +- [ ] **Continuous Integration (CI)**: Set up CI/CD pipelines (e.g., [GitHub Actions](https://docs.github.com/en/actions)) for automated testing. ### Code Distribution & Deployment diff --git a/docs/wiki-guide/Data-Checklist.md b/docs/wiki-guide/Data-Checklist.md index 4948f5a..15fb52f 100644 --- a/docs/wiki-guide/Data-Checklist.md +++ b/docs/wiki-guide/Data-Checklist.md @@ -4,7 +4,7 @@ Below is a checklist encompassing all sections of a dataset card. Review notes a !!! tip "Pro tip" - Use the eye icon at the top of this page to access the source and copy the markdown for the checklist above into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each element to your [dataset card](HF_DatasetCard_Template_mkdocs.md). + Use the eye icon at the top of this page to access the source and copy the markdown for the checklist below into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each element to your [dataset card](HF_DatasetCard_Template_mkdocs.md). ## General Information diff --git a/docs/wiki-guide/FAIR-Guide.md b/docs/wiki-guide/FAIR-Guide.md index 80b9161..36ef567 100644 --- a/docs/wiki-guide/FAIR-Guide.md +++ b/docs/wiki-guide/FAIR-Guide.md @@ -1,6 +1,6 @@ # FAIR Guide -This section provides information and resources to help ensure that digital products are ***F***indable ***A***ccessible ***I***nteroperable ***R***eusable and Reproducible[^1]. A general [Metadata Checklist](Metadata-Checklist.md) is provided to start one thinking about the type of information to be collected. Additionally, we include checklists for [code](Code-Checklist.md), [data](Data-Checklist.md), and [model](Model-Checklist.md) repositories. The code checklist focuses on the contents of a well-documented GitHub repository, while the data and model checklists cover the content of the [data](HF_DatasetCard_Template_mkdocs.md/) and [model](HF_ModelCard_Template_mkdocs.md/) card templates, respectively. +This section provides information and resources to help ensure that digital products are ***F***indable, ***A***ccessible, ***I***nteroperable, ***R***eusable, and Reproducible[^1]. A general [Metadata Checklist](Metadata-Checklist.md) is provided to stimulate thinking about the type of information to be collected. Additionally, we include checklists for [code](Code-Checklist.md), [data](Data-Checklist.md), and [model](Model-Checklist.md) repositories. The code checklist focuses on the contents of a well-documented GitHub repository, while the data and model checklists cover the content of the [data](HF_DatasetCard_Template_mkdocs.md/) and [model](HF_ModelCard_Template_mkdocs.md/) card templates, respectively. Each checklist was developed following the FAIR principles (as defined by the [Go-FAIR Initiative](https://www.go-fair.org/fair-principles/)). They provide a detailed outline of tasks and files to include to ensure alignment with the FAIR principles, and are complementary to the descriptions provided within the [GitHub](GitHub-Repo-Guide.md) and [Hugging Face](Hugging-Face-Repo-Guide.md) Guides presented on this site. As with the contents of these Guides, these checklists are based on a combination of existing guides (e.g., [The Turing Way](https://book.the-turing-way.org/), the [Model Card Guidebook](https://huggingface.co/docs/hub/en/model-card-annotated), and the [Dataset Card Creation Guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md)) and the experiences of our team. Following these checklists ensures digital products are aligned with FAIR principles and a best-effort toward reproducibility.[^2] @@ -11,7 +11,7 @@ Each checklist was developed following the FAIR principles (as defined by the [G The last topic in this section discusses different methods of [DOI Generation](DOI-Generation.md) for digital products (code, data, and models). It focuses on our selected method for dataset publication: [Hugging Face](https://huggingface.co/), with some guidance on using [Zenodo](https://zenodo.org/) to archive code (specifically, a GitHub repository). For more information about other common data publication venues—and to see the thought process behind our selection—see the [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) for more information.[^3] Generating a DOI for a digital product is part of ensuring a globally unique and persistent identifier that can be used to reference and refer back to a digital product—an important component of FAIR and Reproducible principles. !!! info "References and Background" - If you want to learn more about FAIR and Reproducible principles, explore these resources that we used when creating this guide: + If you want to learn more about FAIR and Reproducible principles, explore these resources that we used when developing this guide: - [The Turing Way](https://book.the-turing-way.org/): an open-source, community data science handbook. It provides a strong foundation on the guiding principles for _this_ Guide, providing accessible explanations and overviews of topics from [reproducibility](https://book.the-turing-way.org/reproducible-research/reproducible-research), to [collaboration](https://book.the-turing-way.org/collaboration/collaboration) and [communication](https://book.the-turing-way.org/communication/communication), to [project design](https://book.the-turing-way.org/project-design/project-design), to [ethical research](https://book.the-turing-way.org/ethical-research/ethical-research). @@ -30,4 +30,4 @@ The last topic in this section discusses different methods of [DOI Generation](D [^1]: While "Reproducible" is not part of the original FAIR principles as defined by the [Go-FAIR Initiative](https://www.go-fair.org/fair-principles/), we include it here to emphasize the importance of computational reproducibility alongside data stewardship. This extension reflects emerging practice in data-intensive science, where code, models, and workflows must be reusable and verifiable to support robust scientific claims. It is not part of the formal FAIR acronym, but aligns with broader community goals for open and transparent research. [^2]: Full reproducibility is difficult to achieve; this [presentation](https://drive.google.com/file/d/1BFqZ00zMuyVHaD9A8PvzRDEg7aV0kp3W/view?usp=drive_link) by Odd Erik Gundersen provides a discussion of the varying degrees of reproducibilityand useful references when considering the level of reproducibility achieved by a given project. -[^3]: The [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) was created in May 2023 when we were deciding Institute archive recommendations, so it does not include information about newer features such as [Hugging Face's dataset viewer](https://huggingface.co/docs/hub/en/datasets-viewer), which greatly simplifies previewing datasets for downstream users. +[^3]: The [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) was created in May 2023 as part of developing archive recommendations for the Institute, so it does not include information about newer features such as [Hugging Face's dataset viewer](https://huggingface.co/docs/hub/en/datasets-viewer), which greatly simplifies previewing datasets for downstream users. diff --git a/docs/wiki-guide/Metadata-Checklist.md b/docs/wiki-guide/Metadata-Checklist.md index e82bfab..96a5dc7 100644 --- a/docs/wiki-guide/Metadata-Checklist.md +++ b/docs/wiki-guide/Metadata-Checklist.md @@ -5,7 +5,7 @@ When collecting or compiling new data, there are generally questions one is _try To improve both the _**Findability**_ and _**Reusability**_ of your data (ensuring [FAIR principles](Glossary-for-Imageomics.md#fair-data-principles)) for yourself and others, be sure to note down the following information. !!! note "This is not an exhaustive list." - Be sure to include any other information that may be important to your particular project or field. See, for instance, the [Code](Code-Checklist.md), [Data](Data-Checklist.md), and [Model](Model-Checklist.md) Checklists included in this section. + Be sure to include any other information that may be important to your particular project or field. For instance, see the [Code](Code-Checklist.md), [Data](Data-Checklist.md), and [Model](Model-Checklist.md) Checklists included in this section. ## Checklist for Metadata to Record diff --git a/docs/wiki-guide/Model-Checklist.md b/docs/wiki-guide/Model-Checklist.md index a6c7530..adc6f27 100644 --- a/docs/wiki-guide/Model-Checklist.md +++ b/docs/wiki-guide/Model-Checklist.md @@ -4,7 +4,7 @@ Below is a checklist encompassing all sections of a model card. Review notes and !!! tip "Pro tip" - Use the eye icon at the top of this page to access the source and copy the markdown for the checklist above into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each element to your [model card](HF_ModelCard_Template_mkdocs.md). + Use the eye icon at the top of this page to access the source and copy the markdown for the checklist below into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each element to your [model card](HF_ModelCard_Template_mkdocs.md). ## General Information