iterative · shcheklein · Jun 19, 2020 · May 10, 2020 · Jun 11, 2020 · Jun 11, 2020
diff --git a/example-get-started/code/README.md b/example-get-started/code/README.md
@@ -1,27 +1,22 @@
 # DVC Get Started
 
-This is an auto-generated repository for use in https://dvc.org/doc/get-started.
-Please report any issues in its source project,
-[example-repos-dev](https://github.com/iterative/example-repos-dev).
-
-![](https://dvc.org/static/img/example-flow-2x.png)
+This is an auto-generated repository for use in DVC
+[Get Started](https://dvc.org/doc/get-started). It is a step-by-step quick
+introduction into basic DVC concepts.
 
-_Get Started_ is a step-by-step introduction into basic DVC concepts. It doesn't
-go into details much, but provides links and expandable sections to learn more.
+![](https://dvc.org/img/example-flow-2x.png)
 
-> Note that this project
-[imports](https://dvc.org/doc/commands-reference/import) a dataset from
-https://github.com/iterative/dataset-registry.
+The project is a natural language processing (NLP) binary classifier problem of
+predicting tags for a given StackOverflow question. For example, we want one
+classifier which can predict a post that is about the Python language by tagging
+it `python`.
 
-The idea of the project is a simplified version of the
-[Tutorial](https://dvc.org/doc/tutorial). It explores the natural language
-processing (NLP) problem of predicting tags for a given StackOverflow question.
-For example, we want one classifier which can predict a post that is about the
-Python language by tagging it `python`.
+🐛 Please report any issues found in this project here -
+[example-repos-dev](https://github.com/iterative/example-repos-dev).
 
 ## Installation
 
-Start by cloning the project:
+Python 3.6+ is required to run code from this repo.
 
 ```console
 $ git clone https://github.com/iterative/example-get-started
@@ -60,14 +55,10 @@ Run [`dvc repro`](https://man.dvc.org/repro) to reproduce the
 [pipeline](https://dvc.org/doc/commands-reference/pipeline):
 
 ```console
-$ dvc repro evaluate.dvc
+$ dvc repro
+Data and pipelines are up to date.
 ```
 
-> `dvc repro` requires a target [stage file](https://man.dvc.org/run)
-> ([DVC-file](https://dvc.org/doc/user-guide/dvc-file-format)) to reconstruct
-> and regenerate a pipeline. In this case we use `evaluate.dvc`, the last stage
-> in this project's pipeline.
-
 If you'd like to test commands like [`dvc push`](https://man.dvc.org/push),
 that require write access to the remote storage, the easiest way would be to set
 up a "local remote" on your file system:
@@ -93,30 +84,29 @@ are run in the DVC [get started](https://dvc.org/doc/get-started) guide. Feel
 free to checkout one of them and play with the DVC commands having the
 playground ready.
 
-- `0-empty`: Empty Git repository initialized.
-- `1-initialize`: DVC has been initialized. `.dvc/` with the cache directory
+- `0-git-init`: Empty Git repository initialized.
+- `1-dvc-init`: DVC has been initialized. `.dvc/` with the cache directory
   created.
-- `2-remote`: Remote HTTP storage initialized. It's a shared read only storage
-  that contains all data artifacts produced during next steps.
-- `3-add-file`: Raw data file `data.xml` downloaded and put under DVC control
-  with [`dvc add`](https://man.dvc.org/add). First DVC-file (`.dvc` file
-  extension) created.
-- `4-source`: Source code downloaded and put under Git control.
-- `5-preparation`: First stage file (DVC-file) created using
+- `2-track-data`: Raw data file `data.xml` downloaded and tracked with DVC using
+  [`dvc add`](https://man.dvc.org/add). First `.dvc` file created.
+- `3-config-remote`: Remote HTTP storage initialized. It's a shared read only
+  storage that contains all data artifacts produced during next steps.
+- `4-import-data`: Use `dvc import` to get the same `data.xml` from the DVC data
+  registry.
+- `5-source-code`: Source code downloaded and put into Git.
+- `6-prep-stage`: Create `dvc.yaml` and the first pipeline stage with
   [`dvc run`](https://man.dvc.org/run). It transforms XML data into TSV.
-- `6-featurization`: Feature extraction stage created. It takes data in TSV
-  format and produces two `.pkl` files that contain serialized feature matrices.
-- `7-train`: Model training stage created. It produces `model.pkl` file – the
-  actual result that can then get deployed to an app that implements NLP
-  classification.
-- `8-evaluate`: Evaluation stage. Runs the model on a test dataset to produce
+- `8-ml-pipeline`: Feature extraction and train stages created. It takes data in
+  TSV format and produces two `.pkl` files that contain serialized feature
+  matrices. Tain runs random forest classifier and creates the `model.pkl` file.
+- `9-evaluate`: Evaluation stage. Runs the model on a test dataset to produce
   its performance AUC value. The result is dumped into a DVC metric file so that
   we can compare it with other experiments later.
-- `9-bigrams-model`: Bigrams experiment, code has been modified to extract more
+- `10-bigrams-model`: Bigrams experiment, code has been modified to extract more
   features. We run [`dvc repro`](https://man.dvc.org/repro) for the first time
   to illustrate how DVC can reuse cached files and detect changes along the
   computational graph, regenerating the model with the updated data.
-- `10-bigrams-experiment`: Reproduce the evaluation stage with the bigrams based
+- `11-bigrams-experiment`: Reproduce the evaluation stage with the bigrams based
   model.
 
 There are two additional tags:
@@ -131,32 +121,33 @@ These tags can be used to illustrate `-a` or `-T` options across different
 
 ## Project structure
 
-The data files, DVC-files, and results change as stages are created one by one.
+The data files, DVC files, and results change as stages are created one by one.
 After cloning and using [`dvc pull`](https://man.dvc.org/pull) to download data
-under DVC control, the workspace should look like this:
+tracked by DVC, the workspace should look like this:
 
 ```console
 $ tree
 .
-├── auc.metric            # <-- DVC metric compares baseline and bigrams
+├── README.md
 ├── data                  # <-- Directory with raw and intermediate data
+│   ├── data.xml          # <-- Initial XML StackOverflow dataset (raw data)
+│   ├── data.xml.dvc      # <-- .dvc file - a placeholder/pointer to raw data
 │   ├── features          # <-- Extracted feature matrices
 │   │   ├── test.pkl
 │   │   └── train.pkl
 │   └── prepared          # <-- Processed dataset (split and TSV formatted)
 │       ├── test.tsv
 │       └── train.tsv
-│   ├── data.xml          # <-- Initial XML StackOverflow dataset (raw data)
-│   ├── data.xml.dvc
-├── evaluate.dvc          # <-- DVC-files in the project root describe pipeline
-├── featurize.dvc
-├── model.pkl
-├── prepare.dvc
-├── src                   # <-- Source code to run the pipeline stages
-│   ├── evaluate.py
-│   ├── featurization.py
-│   ├── prepare.py
-│   └── train.py
-│   └── requirements.txt  # <-- Python dependencies needed in the project
-└── train.dvc
+├── dvc.lock
+├── dvc.yaml              # <-- DVC pipeline file
+├── model.pkl             # <-- Trained model file
+├── params.yaml           # <-- Parameters file
+├── prc.json              # <-- Precision-recall curve data points
+├── scores.json           # <-- Binary classifier final metrics (e.g. AUC)
+└── src                   # <-- Source code to run the pipeline stages
+    ├── evaluate.py
+    ├── featurization.py
+    ├── prepare.py
+    ├── requirements.txt  # <-- Python dependencies needed in the project
+    └── train.py
 ```
diff --git a/example-get-started/code/params.yaml b/example-get-started/code/params.yaml
@@ -0,0 +1,11 @@
+prepare:
+  split: 0.20
+  seed: 20170428
+
+featurize:
+  max_features: 500
+  ngrams: 1
+
+train:
+  seed: 20170428
+  n_estimators: 50
diff --git a/example-get-started/code/src/evaluate.py b/example-get-started/code/src/evaluate.py
@@ -1,23 +1,20 @@
 import sys
 import os
+import pickle
 import json
 
 from sklearn.metrics import precision_recall_curve
 import sklearn.metrics as metrics
 
-try:
-    import cPickle as pickle
-except ImportError:
-    import pickle
-
-if len(sys.argv) != 4:
+if len(sys.argv) != 5:
     sys.stderr.write('Arguments error. Usage:\n')
-    sys.stderr.write('\tpython evaluate.py model features output\n')
+    sys.stderr.write('\tpython evaluate.py model features scores plots\n')
     sys.exit(1)
 
 model_file = sys.argv[1]
 matrix_file = os.path.join(sys.argv[2], 'test.pkl')
-metrics_file = sys.argv[3]
+scores_file = sys.argv[3]
+plots_file = sys.argv[4]
 
 with open(model_file, 'rb') as fd:
     model = pickle.load(fd)
@@ -35,5 +32,13 @@
 
 auc = metrics.auc(recall, precision)
 
-with open(metrics_file, 'w') as fd:
-    json.dump({"AUC": auc}, fd)
+with open(scores_file, 'w') as fd:
+    json.dump({'auc': auc}, fd)
+
+with open(plots_file, 'w') as fd:
+    json.dump({'prc': [{
+            'precision': p,
+            'recall': r,
+            'threshold': t
+        } for p, r, t in zip(precision, recall, thresholds)
+    ]}, fd)
diff --git a/example-get-started/code/src/featurization.py b/example-get-started/code/src/featurization.py
@@ -1,82 +1,72 @@
 import os
 import sys
-import errno
 import pandas as pd
 import numpy as np
+import pickle
 import scipy.sparse as sparse
+import yaml
 
 from sklearn.feature_extraction.text import CountVectorizer
 from sklearn.feature_extraction.text import TfidfTransformer
 
-try:
-    import cPickle as pickle
-except ImportError:
-    import pickle
+params = yaml.safe_load(open('params.yaml'))['featurize']
 
 np.set_printoptions(suppress=True)
 
 if len(sys.argv) != 3 and len(sys.argv) != 5:
-    sys.stderr.write("Arguments error. Usage:\n")
+    sys.stderr.write('Arguments error. Usage:\n')
     sys.stderr.write(
-        "\tpython featurization.py data-dir-path features-dir-path\n")
+        '\tpython featurization.py data-dir-path features-dir-path\n'
+    )
     sys.exit(1)
 
-train_input = os.path.join(sys.argv[1], "train.tsv")
-test_input = os.path.join(sys.argv[1], "test.tsv")
-train_output = os.path.join(sys.argv[2], "train.pkl")
-test_output = os.path.join(sys.argv[2], "test.pkl")
-
-try:
-    reload(sys)
-    sys.setdefaultencoding("utf-8")
-except NameError:
-    pass
-
+train_input = os.path.join(sys.argv[1], 'train.tsv')
+test_input = os.path.join(sys.argv[1], 'test.tsv')
+train_output = os.path.join(sys.argv[2], 'train.pkl')
+test_output = os.path.join(sys.argv[2], 'test.pkl')
 
-def mkdir_p(path):
-    try:
-        os.makedirs(path)
-    except OSError as exc:  # Python >2.5
-        if exc.errno == errno.EEXIST and os.path.isdir(path):
-            pass
-        else:
-            raise
+max_features = params['max_features']
+ngrams = params['ngrams']
 
 
 def get_df(data):
     df = pd.read_csv(
         data,
-        encoding="utf-8",
+        encoding='utf-8',
         header=None,
-        delimiter="\t",
-        names=["id", "label", "text"],
+        delimiter='\t',
+        names=['id', 'label', 'text']
     )
-    sys.stderr.write(
-        "The input data frame {} size is {}\n".format(data, df.shape))
+    sys.stderr.write(f'The input data frame {data} size is {df.shape}\n')
     return df
 
 
 def save_matrix(df, matrix, output):
     id_matrix = sparse.csr_matrix(df.id.astype(np.int64)).T
     label_matrix = sparse.csr_matrix(df.label.astype(np.int64)).T
 
-    result = sparse.hstack([id_matrix, label_matrix, matrix], format="csr")
+    result = sparse.hstack([id_matrix, label_matrix, matrix], format='csr')
 
-    msg = "The output matrix {} size is {} and data type is {}\n"
+    msg = 'The output matrix {} size is {} and data type is {}\n'
     sys.stderr.write(msg.format(output, result.shape, result.dtype))
 
-    with open(output, "wb") as fd:
+    with open(output, 'wb') as fd:
         pickle.dump(result, fd, pickle.HIGHEST_PROTOCOL)
     pass
 
 
-mkdir_p(sys.argv[2])
+os.makedirs(sys.argv[2], exist_ok=True)
 
 # Generate train feature matrix
 df_train = get_df(train_input)
-train_words = np.array(df_train.text.str.lower().values.astype("U"))
+train_words = np.array(df_train.text.str.lower().values.astype('U'))
+
+bag_of_words = CountVectorizer(
+    stop_words='english',
+    max_features=max_features,
+    ngram_range=(1, ngrams)
+)
 
-bag_of_words = CountVectorizer(stop_words="english", max_features=5000)
 bag_of_words.fit(train_words)
 train_words_binary_matrix = bag_of_words.transform(train_words)
 tfidf = TfidfTransformer(smooth_idf=False)
@@ -87,7 +77,7 @@ def save_matrix(df, matrix, output):
 
 # Generate test feature matrix
 df_test = get_df(test_input)
-test_words = np.array(df_test.text.str.lower().values.astype("U"))
+test_words = np.array(df_test.text.str.lower().values.astype('U'))
 test_words_binary_matrix = bag_of_words.transform(test_words)
 test_words_tfidf_matrix = tfidf.transform(test_words_binary_matrix)