Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 46 additions & 55 deletions example-get-started/code/README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,22 @@
# DVC Get Started

This is an auto-generated repository for use in https://dvc.org/doc/get-started.
Please report any issues in its source project,
[example-repos-dev](https://github.com/iterative/example-repos-dev).

![](https://dvc.org/static/img/example-flow-2x.png)
This is an auto-generated repository for use in DVC
[Get Started](https://dvc.org/doc/get-started). It is a step-by-step quick
introduction into basic DVC concepts.

_Get Started_ is a step-by-step introduction into basic DVC concepts. It doesn't
go into details much, but provides links and expandable sections to learn more.
![](https://dvc.org/img/example-flow-2x.png)

> Note that this project
[imports](https://dvc.org/doc/commands-reference/import) a dataset from
https://github.com/iterative/dataset-registry.
The project is a natural language processing (NLP) binary classifier problem of
predicting tags for a given StackOverflow question. For example, we want one
classifier which can predict a post that is about the Python language by tagging
it `python`.

The idea of the project is a simplified version of the
[Tutorial](https://dvc.org/doc/tutorial). It explores the natural language
processing (NLP) problem of predicting tags for a given StackOverflow question.
For example, we want one classifier which can predict a post that is about the
Python language by tagging it `python`.
🐛 Please report any issues found in this project here -
[example-repos-dev](https://github.com/iterative/example-repos-dev).

## Installation

Start by cloning the project:
Python 3.6+ is required to run code from this repo.

```console
$ git clone https://github.com/iterative/example-get-started
Expand Down Expand Up @@ -60,14 +55,10 @@ Run [`dvc repro`](https://man.dvc.org/repro) to reproduce the
[pipeline](https://dvc.org/doc/commands-reference/pipeline):

```console
$ dvc repro evaluate.dvc
$ dvc repro
Data and pipelines are up to date.
```

> `dvc repro` requires a target [stage file](https://man.dvc.org/run)
> ([DVC-file](https://dvc.org/doc/user-guide/dvc-file-format)) to reconstruct
> and regenerate a pipeline. In this case we use `evaluate.dvc`, the last stage
> in this project's pipeline.

If you'd like to test commands like [`dvc push`](https://man.dvc.org/push),
that require write access to the remote storage, the easiest way would be to set
up a "local remote" on your file system:
Expand All @@ -93,30 +84,29 @@ are run in the DVC [get started](https://dvc.org/doc/get-started) guide. Feel
free to checkout one of them and play with the DVC commands having the
playground ready.

- `0-empty`: Empty Git repository initialized.
- `1-initialize`: DVC has been initialized. `.dvc/` with the cache directory
- `0-git-init`: Empty Git repository initialized.
- `1-dvc-init`: DVC has been initialized. `.dvc/` with the cache directory
created.
- `2-remote`: Remote HTTP storage initialized. It's a shared read only storage
that contains all data artifacts produced during next steps.
- `3-add-file`: Raw data file `data.xml` downloaded and put under DVC control
with [`dvc add`](https://man.dvc.org/add). First DVC-file (`.dvc` file
extension) created.
- `4-source`: Source code downloaded and put under Git control.
- `5-preparation`: First stage file (DVC-file) created using
- `2-track-data`: Raw data file `data.xml` downloaded and tracked with DVC using
[`dvc add`](https://man.dvc.org/add). First `.dvc` file created.
- `3-config-remote`: Remote HTTP storage initialized. It's a shared read only
storage that contains all data artifacts produced during next steps.
- `4-import-data`: Use `dvc import` to get the same `data.xml` from the DVC data
registry.
- `5-source-code`: Source code downloaded and put into Git.
- `6-prep-stage`: Create `dvc.yaml` and the first pipeline stage with
[`dvc run`](https://man.dvc.org/run). It transforms XML data into TSV.
- `6-featurization`: Feature extraction stage created. It takes data in TSV
format and produces two `.pkl` files that contain serialized feature matrices.
- `7-train`: Model training stage created. It produces `model.pkl` file – the
actual result that can then get deployed to an app that implements NLP
classification.
- `8-evaluate`: Evaluation stage. Runs the model on a test dataset to produce
- `8-ml-pipeline`: Feature extraction and train stages created. It takes data in
TSV format and produces two `.pkl` files that contain serialized feature
matrices. Tain runs random forest classifier and creates the `model.pkl` file.
- `9-evaluate`: Evaluation stage. Runs the model on a test dataset to produce
its performance AUC value. The result is dumped into a DVC metric file so that
we can compare it with other experiments later.
- `9-bigrams-model`: Bigrams experiment, code has been modified to extract more
- `10-bigrams-model`: Bigrams experiment, code has been modified to extract more
features. We run [`dvc repro`](https://man.dvc.org/repro) for the first time
to illustrate how DVC can reuse cached files and detect changes along the
computational graph, regenerating the model with the updated data.
- `10-bigrams-experiment`: Reproduce the evaluation stage with the bigrams based
- `11-bigrams-experiment`: Reproduce the evaluation stage with the bigrams based
model.

There are two additional tags:
Expand All @@ -131,32 +121,33 @@ These tags can be used to illustrate `-a` or `-T` options across different

## Project structure

The data files, DVC-files, and results change as stages are created one by one.
The data files, DVC files, and results change as stages are created one by one.
After cloning and using [`dvc pull`](https://man.dvc.org/pull) to download data
under DVC control, the workspace should look like this:
tracked by DVC, the workspace should look like this:

```console
$ tree
.
├── auc.metric # <-- DVC metric compares baseline and bigrams
├── README.md
├── data # <-- Directory with raw and intermediate data
│   ├── data.xml # <-- Initial XML StackOverflow dataset (raw data)
│   ├── data.xml.dvc # <-- .dvc file - a placeholder/pointer to raw data
│   ├── features # <-- Extracted feature matrices
│   │   ├── test.pkl
│   │   └── train.pkl
│   └── prepared # <-- Processed dataset (split and TSV formatted)
│   ├── test.tsv
│   └── train.tsv
│   ├── data.xml # <-- Initial XML StackOverflow dataset (raw data)
│   ├── data.xml.dvc
├── evaluate.dvc # <-- DVC-files in the project root describe pipeline
├── featurize.dvc
├── model.pkl
├── prepare.dvc
├── src # <-- Source code to run the pipeline stages
│   ├── evaluate.py
│   ├── featurization.py
│   ├── prepare.py
│   └── train.py
│   └── requirements.txt # <-- Python dependencies needed in the project
└── train.dvc
├── dvc.lock
├── dvc.yaml # <-- DVC pipeline file
├── model.pkl # <-- Trained model file
├── params.yaml # <-- Parameters file
├── prc.json # <-- Precision-recall curve data points
├── scores.json # <-- Binary classifier final metrics (e.g. AUC)
└── src # <-- Source code to run the pipeline stages
├── evaluate.py
├── featurization.py
├── prepare.py
├── requirements.txt # <-- Python dependencies needed in the project
└── train.py
```
11 changes: 11 additions & 0 deletions example-get-started/code/params.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
prepare:
split: 0.20
seed: 20170428

featurize:
max_features: 500
ngrams: 1

train:
seed: 20170428
n_estimators: 50
25 changes: 15 additions & 10 deletions example-get-started/code/src/evaluate.py
Original file line number Diff line number Diff line change
@@ -1,23 +1,20 @@
import sys
import os
import pickle
import json

from sklearn.metrics import precision_recall_curve
import sklearn.metrics as metrics

try:
import cPickle as pickle
except ImportError:
import pickle

if len(sys.argv) != 4:
if len(sys.argv) != 5:
sys.stderr.write('Arguments error. Usage:\n')
sys.stderr.write('\tpython evaluate.py model features output\n')
sys.stderr.write('\tpython evaluate.py model features scores plots\n')
sys.exit(1)

model_file = sys.argv[1]
matrix_file = os.path.join(sys.argv[2], 'test.pkl')
metrics_file = sys.argv[3]
scores_file = sys.argv[3]
plots_file = sys.argv[4]

with open(model_file, 'rb') as fd:
model = pickle.load(fd)
Expand All @@ -35,5 +32,13 @@

auc = metrics.auc(recall, precision)

with open(metrics_file, 'w') as fd:
json.dump({"AUC": auc}, fd)
with open(scores_file, 'w') as fd:
json.dump({'auc': auc}, fd)

with open(plots_file, 'w') as fd:
json.dump({'prc': [{
'precision': p,
'recall': r,
'threshold': t
} for p, r, t in zip(precision, recall, thresholds)
]}, fd)
66 changes: 28 additions & 38 deletions example-get-started/code/src/featurization.py
Original file line number Diff line number Diff line change
@@ -1,82 +1,72 @@
import os
import sys
import errno
import pandas as pd
import numpy as np
import pickle
import scipy.sparse as sparse
import yaml

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

try:
import cPickle as pickle
except ImportError:
import pickle
params = yaml.safe_load(open('params.yaml'))['featurize']

np.set_printoptions(suppress=True)

if len(sys.argv) != 3 and len(sys.argv) != 5:
sys.stderr.write("Arguments error. Usage:\n")
sys.stderr.write('Arguments error. Usage:\n')
sys.stderr.write(
"\tpython featurization.py data-dir-path features-dir-path\n")
'\tpython featurization.py data-dir-path features-dir-path\n'
)
sys.exit(1)

train_input = os.path.join(sys.argv[1], "train.tsv")
test_input = os.path.join(sys.argv[1], "test.tsv")
train_output = os.path.join(sys.argv[2], "train.pkl")
test_output = os.path.join(sys.argv[2], "test.pkl")

try:
reload(sys)
sys.setdefaultencoding("utf-8")
except NameError:
pass

train_input = os.path.join(sys.argv[1], 'train.tsv')
test_input = os.path.join(sys.argv[1], 'test.tsv')
train_output = os.path.join(sys.argv[2], 'train.pkl')
test_output = os.path.join(sys.argv[2], 'test.pkl')

def mkdir_p(path):
try:
os.makedirs(path)
except OSError as exc: # Python >2.5
if exc.errno == errno.EEXIST and os.path.isdir(path):
pass
else:
raise
max_features = params['max_features']
ngrams = params['ngrams']


def get_df(data):
df = pd.read_csv(
data,
encoding="utf-8",
encoding='utf-8',
header=None,
delimiter="\t",
names=["id", "label", "text"],
delimiter='\t',
names=['id', 'label', 'text']
)
sys.stderr.write(
"The input data frame {} size is {}\n".format(data, df.shape))
sys.stderr.write(f'The input data frame {data} size is {df.shape}\n')
return df


def save_matrix(df, matrix, output):
id_matrix = sparse.csr_matrix(df.id.astype(np.int64)).T
label_matrix = sparse.csr_matrix(df.label.astype(np.int64)).T

result = sparse.hstack([id_matrix, label_matrix, matrix], format="csr")
result = sparse.hstack([id_matrix, label_matrix, matrix], format='csr')

msg = "The output matrix {} size is {} and data type is {}\n"
msg = 'The output matrix {} size is {} and data type is {}\n'
sys.stderr.write(msg.format(output, result.shape, result.dtype))

with open(output, "wb") as fd:
with open(output, 'wb') as fd:
pickle.dump(result, fd, pickle.HIGHEST_PROTOCOL)
pass


mkdir_p(sys.argv[2])
os.makedirs(sys.argv[2], exist_ok=True)

# Generate train feature matrix
df_train = get_df(train_input)
train_words = np.array(df_train.text.str.lower().values.astype("U"))
train_words = np.array(df_train.text.str.lower().values.astype('U'))

bag_of_words = CountVectorizer(
stop_words='english',
max_features=max_features,
ngram_range=(1, ngrams)
)

bag_of_words = CountVectorizer(stop_words="english", max_features=5000)
bag_of_words.fit(train_words)
train_words_binary_matrix = bag_of_words.transform(train_words)
tfidf = TfidfTransformer(smooth_idf=False)
Expand All @@ -87,7 +77,7 @@ def save_matrix(df, matrix, output):

# Generate test feature matrix
df_test = get_df(test_input)
test_words = np.array(df_test.text.str.lower().values.astype("U"))
test_words = np.array(df_test.text.str.lower().values.astype('U'))
test_words_binary_matrix = bag_of_words.transform(test_words)
test_words_tfidf_matrix = tfidf.transform(test_words_binary_matrix)

Expand Down
Loading