hackalog · acwooding · Jan 10, 2023 · Dec 30, 2022 · Dec 31, 2022 · Dec 31, 2022
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -8,7 +8,7 @@ jobs:
     docker:
       # specify the version you desire here
       # use `-browsers` prefix for selenium tests, e.g. `3.6.1-browsers`
-      - image: cimg/python:3.8.0
+      - image: continuumio/miniconda3
 
       # Specify service dependencies here if necessary
       # CircleCI maintains a library of pre-built images
@@ -19,39 +19,37 @@ jobs:
 
     steps:
       - checkout
-
+      
       - run:
-          name: Set up Anaconda
+          name: Set up Conda
           command: |
-            wget -q http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh;
-            chmod +x ~/miniconda.sh;
-            ~/miniconda.sh -b -p ~/miniconda;
-            export PATH=~/miniconda/bin:$PATH
-            echo "export PATH=~/miniconda/bin:$PATH" >> $BASH_ENV;
-            conda update --yes --quiet conda;
             conda init bash
-            sed -ne '/>>> conda initialize/,/<<< conda initialize/p' ~/.bashrc >> $BASH_ENV
-
+            conda update --yes --quiet conda;
+            export CONDA_EXE=/opt/conda/bin/conda
+            sed -ne '/>>> conda initialize/,/<<< conda initialize/p' ~/.bashrc >> $BASH_ENV 
+
       - run:
           name: Build cookiecutter environment and test-env project
           command: |
-            conda create -n cookiecutter --yes python=3.8
+            conda create -n cookiecutter --yes python=3.8 make
             conda activate cookiecutter
             pip install cookiecutter
             pip install ruamel.yaml
-            mkdir /home/circleci/.cookiecutter_replay
-            cp circleci-cookiecutter-easydata.json /home/circleci/.cookiecutter_replay/cookiecutter-easydata.json
+            mkdir -p /root/repo/.cookiecutter_replay
+            cp circleci-cookiecutter-easydata.json /root/repo/.cookiecutter_replay/cookiecutter-easydata.json
             pwd
+            which make
             cookiecutter --config-file .cookiecutter-easydata-test-circleci.yml . -f --no-input
-            conda deactivate
-
+
       - run:
           name: Create test-env environment and contrive to always use it
           command: |
+            conda activate cookiecutter
             cd test-env
-            export CONDA_EXE=/home/circleci/miniconda/bin/conda
+            export CONDA_EXE=/opt/conda/bin/conda
             make create_environment
             conda activate test-env
+            conda install -c anaconda make
             touch environment.yml
             make update_environment
             echo "conda activate test-env" >> $BASH_ENV;

diff --git a/docs/00-xyz-sample-notebook.ipynb b/docs/00-xyz-sample-notebook.ipynb
@@ -150,7 +150,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "print(ds.DESCR)"
+    "print(ds.README)"
    ]
   },
   {

diff --git a/docs/Add-csv-template.ipynb b/docs/Add-csv-template.ipynb
@@ -83,7 +83,7 @@
     "* `csv_path`: The desired path to your .csv file (in this case `epidemiology.csv`) relative to paths['raw_data_path']\n",
     "* `download_message`: The message to display to indicate to the user how to manually download your .csv file.\n",
     "* `license_str`: Information on the license for the dataset\n",
-    "* `descr_str`: Information on the dataset itself"
+    "* `readme_str`: Information on the dataset itself"
    ]
   },
   {
@@ -123,7 +123,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "descr_str = \"\"\"\n",
+    "readme_str = \"\"\"\n",
     "The epidemiology table from Google's [COVID-19 Open-Data dataset](https://github.com/GoogleCloudPlatform/covid-19-open-data). \n",
     "\n",
     "The full dataset contains datasets of daily time-series data related to COVID-19 for over 20,000 distinct locations around the world. The data is at the spatial resolution of states/provinces for most regions and at county/municipality resolution for many countries such as Argentina, Brazil, Chile, Colombia, Czech Republic, Mexico, Netherlands, Peru, United Kingdom, and USA. All regions are assigned a unique location key, which resolves discrepancies between ISO / NUTS / FIPS codes, etc. The different aggregation levels are:\n",
@@ -170,7 +170,7 @@
     "                                               csv_path=csv_path,\n",
     "                                               download_message=download_message,\n",
     "                                               license_str=license_str,\n",
-    "                                               descr_str=descr_str,\n",
+    "                                               readme_str=readme_str,\n",
     "                                               overwrite_catalog=True)"
    ]
   },
@@ -206,9 +206,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "By default, the workflow helper function also created a `covid-19-epidemiology_raw` dataset that has an empty `ds.data`, but keeps a record of the location of the final `epidemiology.csv` file relative to  in `ds.EXTRA`.\n",
+    "By default, the workflow helper function also created a `covid-19-epidemiology_raw` dataset that has an empty `ds.data`, but keeps a record of the location of the final `epidemiology.csv` file relative to  in `ds.FILESET`.\n",
     "\n",
-    "The `.EXTRA` functionality is covered in other documentation."
+    "The `.FILESET` functionality is covered in other documentation."
    ]
   },
   {
@@ -236,7 +236,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "ds_raw.EXTRA"
+    "ds_raw.FILESET"
    ]
   },
   {
@@ -246,7 +246,7 @@
    "outputs": [],
    "source": [
     "# fq path to epidemiology.csv file\n",
-    "ds_raw.extra_file('epidemiology.csv')"
+    "ds_raw.fileset_file('epidemiology.csv')"
    ]
   },
   {

diff --git a/docs/Add-derived-dataset.ipynb b/docs/Add-derived-dataset.ipynb
@@ -85,7 +85,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "print(ds.DESCR)"
+    "print(ds.README)"
    ]
   },
   {
@@ -219,7 +219,7 @@
     "    source_dataset_name\n",
     "    dataset_name\n",
     "    data_function\n",
-    "    added_descr_txt\n",
+    "    added_readme_txt\n",
     "\n",
     "We'll want our `data_function` to be defined in the project module (in this case `src`) for reproducibility reasons (which we've already done with `subselect_by_key` above)."
    ]
@@ -250,7 +250,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "added_descr_txt = f\"\"\"The dataset {dataset_name} is the subselection \\\n",
+    "added_readme_txt = f\"\"\"The dataset {dataset_name} is the subselection \\\n",
     "to the {key} dataset.\"\"\""
    ]
   },
@@ -281,7 +281,7 @@
     "        source_dataset_name=source_dataset_name,\n",
     "        dataset_name=dataset_name,\n",
     "        data_function=data_function,\n",
-    "        added_descr_txt=added_descr_txt,\n",
+    "        added_readme_txt=added_readme_txt,\n",
     "        overwrite_catalog=True)"
    ]
   },
@@ -318,7 +318,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "print(ds.DESCR)"
+    "print(ds.README)"
    ]
   },
   {

diff --git a/docs/New-Dataset-Template.ipynb b/docs/New-Dataset-Template.ipynb
@@ -167,7 +167,7 @@
    "metadata": {},
    "source": [
     "### Create a process function\n",
-    "By default, we recommend that you use the `process_extra_files` functionality and then use a transformer function to create a derived dataset, but you can optionally create your own."
+    "By default, we recommend that you use the `process_fileset_files` functionality and then use a transformer function to create a derived dataset, but you can optionally create your own."
    ]
   },
   {
@@ -176,11 +176,11 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from src.data.extra import process_extra_files\n",
-    "process_function = process_extra_files\n",
+    "from src.data.fileset import process_fileset_files\n",
+    "process_function = process_fileset_files\n",
     "process_function_kwargs = {'file_glob':'*.csv',\n",
     "                           'do_copy': True,\n",
-    "                           'extra_dir': ds_name+'.extra',\n",
+    "                           'fileset_dir': ds_name+'.fileset',\n",
     "                           'extract_dir': ds_name}"
    ]
   },
@@ -355,7 +355,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "ds.EXTRA"
+    "ds.FILESET"
    ]
   },
   {
@@ -364,7 +364,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "ds.extra_file('epidemiology.csv')"
+    "ds.fileset_file('epidemiology.csv')"
    ]
   },
   {

diff --git a/docs/New-Edge-Template.ipynb b/docs/New-Edge-Template.ipynb
@@ -88,7 +88,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "source_ds.EXTRA"
+    "source_ds.FILESET"
    ]
   },
   {
@@ -178,7 +178,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "print(ds.DESCR)"
+    "print(ds.README)"
    ]
   },
   {

diff --git a/docs/test_docs.py b/docs/test_docs.py
@@ -9,6 +9,8 @@
 import requests
 
 from src import paths
+from src.log import logger
+
 
 CCDS_ROOT = Path(__file__).parents[1].resolve()
 DOCS_DIR = CCDS_ROOT / "docs"
@@ -35,6 +37,7 @@ def test_notebook_csv(self):
         csv_url = "https://storage.googleapis.com/covid19-open-data/v2/epidemiology.csv"
         csv_dest = paths['raw_data_path'] / "epidemiology.csv"
         if not csv_dest.exists():
+            logger.debug("Downloading epidemiology.csv")
             csv_file = requests.get(csv_url)
             with open(csv_dest, 'wb') as f:
                 f.write(csv_file.content)

diff --git a/{{ cookiecutter.repo_name }}/.circleci/config.yml b/{{ cookiecutter.repo_name }}/.circleci/config.yml
@@ -8,7 +8,8 @@ jobs:
     docker:
       # specify the version you desire here
       # use `-browsers` prefix for selenium tests, e.g. `3.6.1-browsers`
-      - image: circleci/python:3.7.0
+      - image: continuumio/miniconda3
+
 
       # Specify service dependencies here if necessary
       # CircleCI maintains a library of pre-built images
@@ -20,14 +21,6 @@ jobs:
     steps:
       - checkout
 
-      - run:
-          name: Set up Anaconda
-          command: |
-            wget -q http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh;
-            chmod +x ~/miniconda.sh;
-            ~/miniconda.sh -b -p ~/miniconda;
-            echo "export PATH=~/miniconda/bin:$PATH" >> $BASH_ENV;
-
       - run:
           name: Create environment and contrive to always use it
           command: |

diff --git a/{{ cookiecutter.repo_name }}/reference/easydata/conda-environments.md b/{{ cookiecutter.repo_name }}/reference/easydata/conda-environments.md
@@ -4,13 +4,19 @@ The `{{ cookiecutter.repo_name }}` repo is set up with template code to make man
 
 If you haven't yet, configure your conda environment.
 
+**WARNING**: If you have conda-forge listed as a channel in your `.condarc` (or any other channels other than defaults), you may experience great difficulty generating reproducible conda environments.
+
+We recommend you remove conda-forge (and all other non-default channels) from your `.condarc` file and [set your channel priority to 'strict'](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-channels.html). You can still use conda-forge (or any other conda channel), just specify it explicitly in your `environment.yml` by prefixing your package name with `channel-name::`; e.g.
+```
+  - wheel                    # install from the default (anaconda) channel
+  - pytorch::pytorch         # install this from the `pytorch` channel
+  - conda-forge::tokenizers  # install this from conda-forge
+```
+
 ## Configuring your python environment
 Easydata uses conda to manage python packages installed by both conda **and pip**.
 
 ### Adjust your `.condarc`
-**WARNING FOR EXISTING CONDA USERS**: If you have `conda-forge` listed as a channel in your `.condarc` (or any other channels other than `default`), **remove them**. These channels should be specified in `environment.yml` instead.
-
-We also recommend [setting your channel priority to 'strict'](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-channels.html) to reduce package incompatibility problems. This will be the default in conda 5.0, but in order to assure reproducibility, we need to use this behavior now.
 
 ```
 conda config --set channel_priority strict
@@ -26,18 +32,30 @@ conda config --prepend channels defaults
 conda config --prepend envs_dirs ~/.conda/envs   # Store environments in local dir for JupyterHub
 ```
 
-### Fix the CONDA_EXE path
-* Make note of the path to your conda binary:
+#### Locating the `conda` binary
+Ensure the Makefile can find your conda binary, either by setting the `CONDA_EXE` environment variable, or by modifying `Makefile.include` directly.
+
+First, check if `CONDA_EXE` is already set
 ```
-   $ which conda
+   >>> export | grep CONDA_EXE
+   CONDA_EXE=/Users/your_username/miniconda3/bin/conda
+```
+
+If `CONDA_EXE` is not set, you will need to set it manually in `Makefile.include`; i.e.
+
+* Make note of the path to your conda binary. It should be in the `bin` subdirectory of your Anaconda (or miniconda) installation directory:
+```
+   >>>  which conda         # this will only work if conda is in your PATH, otherwise, verify manually
    ~/miniconda3/bin/conda
 ```
-* ensure your `CONDA_EXE` environment variable is set correctly in `Makefile.include`
+* ensure your `CONDA_EXE` environment variable is set to this value; i.e.
 ```
-    export CONDA_EXE=~/miniconda3/bin/conda
+    >>> export CONDA_EXE=~/miniconda3/bin/conda
 ```
+or edit `Makefile.include` directly.
+
 ### Create the conda environment
-* Create and switch to the virtual environment:
+Create and switch to the virtual environment:
 ```
 cd {{ cookiecutter.repo_name }}
 make create_environment