From b569c3ef18c75580654c103c769303539de24419 Mon Sep 17 00:00:00 2001 From: Ana Trisovic Date: Tue, 23 Mar 2021 17:51:23 -0400 Subject: [PATCH 1/6] add guidelines on research code in User Guide --- .../source/user/dataset-management.rst | 37 +++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/doc/sphinx-guides/source/user/dataset-management.rst b/doc/sphinx-guides/source/user/dataset-management.rst index 1d5f1344c6c..9584db8a2f5 100755 --- a/doc/sphinx-guides/source/user/dataset-management.rst +++ b/doc/sphinx-guides/source/user/dataset-management.rst @@ -179,6 +179,43 @@ Additional download options available for tabular data (found in the same drop-d - Data File Citation (currently in either RIS, EndNote XML, or BibTeX format); - All of the above, as a zipped bundle. +Research Code +------------- + +Code files - such as Stata, R, MATLAB, or Python files or scripts - have become a frequent addition to the research data deposited in Dataverse. Research code is typically developed by few researchers with the primary goal of obtaining results, while its reproducibility and reuse aspects are sometimes overlooked. Because several independent studies reported issues trying to rerun research code, please consider the following guidelines if your dataset contains code. + +The following are general guidelines applicable to all programming languages. + +- Create a README text file in the top-level directory to introduce your project. It should answer questions that reviewers or reusers would likely have, such as how to install and use your code. +- If possible, use free and open-source file formats and software to make your research outputs more reusable and accessible. +- Make sure your code uses relative file paths instead of absolute (or full) file paths, as they can cause an execution error. +- Consider testing your code in a clean environment before sharing it, as it could help you identify missing files or dependencies. +- Consider providing notes (in the README) on the expected code outputs or adding tests in the code, which would ensure that its functionality is intact. + +General Guidelines +****************** + +Capturing code dependencies will help other researchers recreate the necessary runtime environment. Without it, your code will not be able to run correctly (or at all). + +Many programing languages enable an automatic dependency capture. Here are a few examples: + +- If you are using the conda package manager, you can export your environment with the command ``conda env export > environment.yml``. For more information, see the `official documentation `_. +- Python has multiple conventions for capturing its dependencies, but probably the best-known one is with the ``requirements.txt`` file, which is created using the command ``pip freeze > requirements. txt``. Managing environments with ``pip`` is explained in the `official documentation `_. +- If you are using the R programming language, create a file called ``install.R``\ , and list all library dependencies that your code requires. This file should be executable in R to set up the environment. See also other strategies for capturing the environment proposed by RStudio in the `official documentation `_. +- In case you are using multiple programming languages or different versions of the same language, it would be best to use a containerization technology such as Docker. Create a Dockerfile that builds your environment (make sure to specify dependency versions) and deposit it with the rest of the files. Have a look at the `official Docker guide `_ for more information. + +We recommend taking a look at `Jupyter Binder's documentation `_ for other configuration files that capture code dependencies. + +Code Automation +*************** + +Finally, automating your code can be immensely helpful to the code and research reviewers. Here are a few options on how to automate your code. + +- A simple way to automate your code is using a bash script or Make. `Here is a detailed guide `_ on how to use the Make build automation tool. +- Consider using research workflow tools to automate your analysis. A popular workflow tool is called Common Workflow Language, and you can find more information about it `here `_. + +**Note:** Capturing code dependencies and automating your code will create new files in your directory. Make sure to include them when depositing your dataset. + Astronomy (FITS) ---------------- From f0c12aefb8eafd7ae2820ea321bafb3ff9cfa77a Mon Sep 17 00:00:00 2001 From: Julian Gautier Date: Wed, 24 Mar 2021 14:26:30 -0400 Subject: [PATCH 2/6] Minor style changes, mostly about displaying links --- doc/sphinx-guides/source/user/dataset-management.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/sphinx-guides/source/user/dataset-management.rst b/doc/sphinx-guides/source/user/dataset-management.rst index 9584db8a2f5..3ef00631d7a 100755 --- a/doc/sphinx-guides/source/user/dataset-management.rst +++ b/doc/sphinx-guides/source/user/dataset-management.rst @@ -182,7 +182,7 @@ Additional download options available for tabular data (found in the same drop-d Research Code ------------- -Code files - such as Stata, R, MATLAB, or Python files or scripts - have become a frequent addition to the research data deposited in Dataverse. Research code is typically developed by few researchers with the primary goal of obtaining results, while its reproducibility and reuse aspects are sometimes overlooked. Because several independent studies reported issues trying to rerun research code, please consider the following guidelines if your dataset contains code. +Code files - such as Stata, R, MATLAB, or Python files or scripts - have become a frequent addition to the research data deposited in Dataverse repositories. Research code is typically developed by few researchers with the primary goal of obtaining results, while its reproducibility and reuse aspects are sometimes overlooked. Because several independent studies reported issues trying to rerun research code, please consider the following guidelines if your dataset contains code. The following are general guidelines applicable to all programming languages. @@ -202,7 +202,7 @@ Many programing languages enable an automatic dependency capture. Here are a few - If you are using the conda package manager, you can export your environment with the command ``conda env export > environment.yml``. For more information, see the `official documentation `_. - Python has multiple conventions for capturing its dependencies, but probably the best-known one is with the ``requirements.txt`` file, which is created using the command ``pip freeze > requirements. txt``. Managing environments with ``pip`` is explained in the `official documentation `_. - If you are using the R programming language, create a file called ``install.R``\ , and list all library dependencies that your code requires. This file should be executable in R to set up the environment. See also other strategies for capturing the environment proposed by RStudio in the `official documentation `_. -- In case you are using multiple programming languages or different versions of the same language, it would be best to use a containerization technology such as Docker. Create a Dockerfile that builds your environment (make sure to specify dependency versions) and deposit it with the rest of the files. Have a look at the `official Docker guide `_ for more information. +- In case you are using multiple programming languages or different versions of the same language, it would be best to use a containerization technology such as Docker. Create a Dockerfile that builds your environment, making sure to specify dependency versions, and deposit it with the rest of the files. Have a look at the `official Docker guide `_ for more information. We recommend taking a look at `Jupyter Binder's documentation `_ for other configuration files that capture code dependencies. @@ -211,8 +211,8 @@ Code Automation Finally, automating your code can be immensely helpful to the code and research reviewers. Here are a few options on how to automate your code. -- A simple way to automate your code is using a bash script or Make. `Here is a detailed guide `_ on how to use the Make build automation tool. -- Consider using research workflow tools to automate your analysis. A popular workflow tool is called Common Workflow Language, and you can find more information about it `here `_. +- A simple way to automate your code is using a bash script or Make. `The Turing Way Community has `a detailed guide `_ on how to use the Make build automation tool. +- Consider using research workflow tools to automate your analysis. A popular workflow tool is called Common Workflow Language, and you can find more information about it `from the Common Workflow Language User Guide `_. **Note:** Capturing code dependencies and automating your code will create new files in your directory. Make sure to include them when depositing your dataset. From 4cb739422629b0b84585f2e40d997ee5e0a00ece Mon Sep 17 00:00:00 2001 From: Ana Trisovic Date: Mon, 29 Mar 2021 11:41:22 -0400 Subject: [PATCH 3/6] tiny edits --- doc/sphinx-guides/source/user/dataset-management.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/sphinx-guides/source/user/dataset-management.rst b/doc/sphinx-guides/source/user/dataset-management.rst index 3ef00631d7a..94add08229c 100755 --- a/doc/sphinx-guides/source/user/dataset-management.rst +++ b/doc/sphinx-guides/source/user/dataset-management.rst @@ -201,7 +201,7 @@ Many programing languages enable an automatic dependency capture. Here are a few - If you are using the conda package manager, you can export your environment with the command ``conda env export > environment.yml``. For more information, see the `official documentation `_. - Python has multiple conventions for capturing its dependencies, but probably the best-known one is with the ``requirements.txt`` file, which is created using the command ``pip freeze > requirements. txt``. Managing environments with ``pip`` is explained in the `official documentation `_. -- If you are using the R programming language, create a file called ``install.R``\ , and list all library dependencies that your code requires. This file should be executable in R to set up the environment. See also other strategies for capturing the environment proposed by RStudio in the `official documentation `_. +- If you are using the R programming language, create a file called ``install.R``, and list all library dependencies that your code requires. This file should be executable in R to set up the environment. See also other strategies for capturing the environment proposed by RStudio in the `official documentation `_. - In case you are using multiple programming languages or different versions of the same language, it would be best to use a containerization technology such as Docker. Create a Dockerfile that builds your environment, making sure to specify dependency versions, and deposit it with the rest of the files. Have a look at the `official Docker guide `_ for more information. We recommend taking a look at `Jupyter Binder's documentation `_ for other configuration files that capture code dependencies. @@ -211,7 +211,7 @@ Code Automation Finally, automating your code can be immensely helpful to the code and research reviewers. Here are a few options on how to automate your code. -- A simple way to automate your code is using a bash script or Make. `The Turing Way Community has `a detailed guide `_ on how to use the Make build automation tool. +- A simple way to automate your code is using a bash script or Make. The Turing Way Community has `a detailed guide `_ on how to use the Make build automation tool. - Consider using research workflow tools to automate your analysis. A popular workflow tool is called Common Workflow Language, and you can find more information about it `from the Common Workflow Language User Guide `_. **Note:** Capturing code dependencies and automating your code will create new files in your directory. Make sure to include them when depositing your dataset. From 739fb2b18181504afa23e79803589b25616a630f Mon Sep 17 00:00:00 2001 From: Ana Trisovic Date: Thu, 8 Apr 2021 12:20:23 -0400 Subject: [PATCH 4/6] add more guidelines and info on repro platforms --- .../source/user/dataset-management.rst | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/doc/sphinx-guides/source/user/dataset-management.rst b/doc/sphinx-guides/source/user/dataset-management.rst index 94add08229c..99348948a04 100755 --- a/doc/sphinx-guides/source/user/dataset-management.rst +++ b/doc/sphinx-guides/source/user/dataset-management.rst @@ -186,25 +186,24 @@ Code files - such as Stata, R, MATLAB, or Python files or scripts - have become The following are general guidelines applicable to all programming languages. -- Create a README text file in the top-level directory to introduce your project. It should answer questions that reviewers or reusers would likely have, such as how to install and use your code. +- Create a README text file in the top-level directory to introduce your project. It should answer questions that reviewers or reusers would likely have, such as how to install and use your code. If in doubt, consider using existing templates such as `a README template for social science replication packages `_. +- Depending on the number of files in your dataset, consider having data and code in distinct directories, each of which should have some documentation like a README. +- Consider adding a license to your source code. You can do that by creating a LICENSE file in the dataset or by specifying the license(s) in the README or directly in the code. Find out more about code licenses at `the Open Source Initiative webpage `_. - If possible, use free and open-source file formats and software to make your research outputs more reusable and accessible. -- Make sure your code uses relative file paths instead of absolute (or full) file paths, as they can cause an execution error. -- Consider testing your code in a clean environment before sharing it, as it could help you identify missing files or dependencies. +- Consider testing your code in a clean environment before sharing it, as it could help you identify missing files or other errors. For example, your code should use relative file paths instead of absolute (or full) file paths, as they can cause an execution error. - Consider providing notes (in the README) on the expected code outputs or adding tests in the code, which would ensure that its functionality is intact. General Guidelines ****************** Capturing code dependencies will help other researchers recreate the necessary runtime environment. Without it, your code will not be able to run correctly (or at all). - -Many programing languages enable an automatic dependency capture. Here are a few examples: +One option is to use platforms such as `Whole Tale `_, `Jupyter Binder `_ or `Renku `_, which facilitate research reproducibility. Have a look at `Dataverse Integrations `_ for more information. +Another option is to use an automatic code dependency capture, which is often supported through the programming language. Here are a few examples: - If you are using the conda package manager, you can export your environment with the command ``conda env export > environment.yml``. For more information, see the `official documentation `_. - Python has multiple conventions for capturing its dependencies, but probably the best-known one is with the ``requirements.txt`` file, which is created using the command ``pip freeze > requirements. txt``. Managing environments with ``pip`` is explained in the `official documentation `_. - If you are using the R programming language, create a file called ``install.R``, and list all library dependencies that your code requires. This file should be executable in R to set up the environment. See also other strategies for capturing the environment proposed by RStudio in the `official documentation `_. -- In case you are using multiple programming languages or different versions of the same language, it would be best to use a containerization technology such as Docker. Create a Dockerfile that builds your environment, making sure to specify dependency versions, and deposit it with the rest of the files. Have a look at the `official Docker guide `_ for more information. - -We recommend taking a look at `Jupyter Binder's documentation `_ for other configuration files that capture code dependencies. +- In case you are using multiple programming languages or different versions of the same language, consider using a containerization technology such as Docker. You can create a Dockerfile that builds your environment and deposit it within your dataset (see `the official documentation `_). It is worth noting that creating a reliable Dockerfile may be tricky. If you choose this route, make sure to specify dependency versions and check out `Docker's best practices `_. Code Automation *************** From 8f3cbd66e08507a220b425edf58e6ce632345179 Mon Sep 17 00:00:00 2001 From: Kevin Condon Date: Thu, 29 Apr 2021 11:20:22 -0400 Subject: [PATCH 5/6] Update dataset-management.rst Removing lines 196-197, 208-209 to resolve warning: Warning, treated as error: /home/worker/workspace/guides.dataverse.org/doc/sphinx-guides/source/user/dataset-management.rst:200: SEVERE: Title level inconsistent: --- doc/sphinx-guides/source/user/dataset-management.rst | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/doc/sphinx-guides/source/user/dataset-management.rst b/doc/sphinx-guides/source/user/dataset-management.rst index 99348948a04..f4d21eecb5c 100755 --- a/doc/sphinx-guides/source/user/dataset-management.rst +++ b/doc/sphinx-guides/source/user/dataset-management.rst @@ -193,8 +193,7 @@ The following are general guidelines applicable to all programming languages. - Consider testing your code in a clean environment before sharing it, as it could help you identify missing files or other errors. For example, your code should use relative file paths instead of absolute (or full) file paths, as they can cause an execution error. - Consider providing notes (in the README) on the expected code outputs or adding tests in the code, which would ensure that its functionality is intact. -General Guidelines -****************** + Capturing code dependencies will help other researchers recreate the necessary runtime environment. Without it, your code will not be able to run correctly (or at all). One option is to use platforms such as `Whole Tale `_, `Jupyter Binder `_ or `Renku `_, which facilitate research reproducibility. Have a look at `Dataverse Integrations `_ for more information. @@ -205,8 +204,7 @@ Another option is to use an automatic code dependency capture, which is often su - If you are using the R programming language, create a file called ``install.R``, and list all library dependencies that your code requires. This file should be executable in R to set up the environment. See also other strategies for capturing the environment proposed by RStudio in the `official documentation `_. - In case you are using multiple programming languages or different versions of the same language, consider using a containerization technology such as Docker. You can create a Dockerfile that builds your environment and deposit it within your dataset (see `the official documentation `_). It is worth noting that creating a reliable Dockerfile may be tricky. If you choose this route, make sure to specify dependency versions and check out `Docker's best practices `_. -Code Automation -*************** + Finally, automating your code can be immensely helpful to the code and research reviewers. Here are a few options on how to automate your code. From ca0a01bfd3beb44ef9fc7a1267867d7a78894e43 Mon Sep 17 00:00:00 2001 From: Ana Trisovic Date: Thu, 29 Apr 2021 12:01:26 -0400 Subject: [PATCH 6/6] tiny fix in URLs --- doc/sphinx-guides/source/user/dataset-management.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/sphinx-guides/source/user/dataset-management.rst b/doc/sphinx-guides/source/user/dataset-management.rst index f4d21eecb5c..8b7536dd5f6 100755 --- a/doc/sphinx-guides/source/user/dataset-management.rst +++ b/doc/sphinx-guides/source/user/dataset-management.rst @@ -199,10 +199,10 @@ Capturing code dependencies will help other researchers recreate the necessary r One option is to use platforms such as `Whole Tale `_, `Jupyter Binder `_ or `Renku `_, which facilitate research reproducibility. Have a look at `Dataverse Integrations `_ for more information. Another option is to use an automatic code dependency capture, which is often supported through the programming language. Here are a few examples: -- If you are using the conda package manager, you can export your environment with the command ``conda env export > environment.yml``. For more information, see the `official documentation `_. -- Python has multiple conventions for capturing its dependencies, but probably the best-known one is with the ``requirements.txt`` file, which is created using the command ``pip freeze > requirements. txt``. Managing environments with ``pip`` is explained in the `official documentation `_. -- If you are using the R programming language, create a file called ``install.R``, and list all library dependencies that your code requires. This file should be executable in R to set up the environment. See also other strategies for capturing the environment proposed by RStudio in the `official documentation `_. -- In case you are using multiple programming languages or different versions of the same language, consider using a containerization technology such as Docker. You can create a Dockerfile that builds your environment and deposit it within your dataset (see `the official documentation `_). It is worth noting that creating a reliable Dockerfile may be tricky. If you choose this route, make sure to specify dependency versions and check out `Docker's best practices `_. +- If you are using the conda package manager, you can export your environment with the command ``conda env export > environment.yml``. For more information, see the `official documentation `__. +- Python has multiple conventions for capturing its dependencies, but probably the best-known one is with the ``requirements.txt`` file, which is created using the command ``pip freeze > requirements. txt``. Managing environments with ``pip`` is explained in the `official documentation `__. +- If you are using the R programming language, create a file called ``install.R``, and list all library dependencies that your code requires. This file should be executable in R to set up the environment. See also other strategies for capturing the environment proposed by RStudio in the `official documentation `__. +- In case you are using multiple programming languages or different versions of the same language, consider using a containerization technology such as Docker. You can create a Dockerfile that builds your environment and deposit it within your dataset (see `the official documentation `__). It is worth noting that creating a reliable Dockerfile may be tricky. If you choose this route, make sure to specify dependency versions and check out `Docker's best practices `_.