GitHub - ImScientist/NLU: Natural language understanding project

We apply a model that uses BERT as a backbone to two similar problems:

Google QUEST Q&A Labeling: assign 30 scores (from 0 to 1) to a question-answer pair.
Tweet sentiment analysis: assign a sentiment to a tweet (positive, negative or neutral). This task differs from the original task of the competition.

Theory

Quick start

Download the data from here, modify the variables DATA_DIR, RESULTS_DIR from .env and load it:
```
source .env
```
create a python virtual environment and install the dependencies:
```
conda create -n nlu python=3.6 -y
conda activate nlu 
pip install -r requirements.txt
python setup.py install
```

Tweet sentiment analysis

Train the model (remove size_tr_val to use the complete dataset; size_val refers to the size of the validation dataset):

python exec/train_tweet_sentiment.py \
    --data_path "${DATA_DIR}/train.csv" \
    --model_dir "${RESULTS_DIR}/models" \
    --log_dir "${RESULTS_DIR}/logs" \
    --size_val 2700 \
    --batch_size 50 \
    --num_epochs 10 \
    --print_freq 200 \
    --seed 10

Results from the training can be visualised with tensorboard:
```
tensorboard --logdir=${RESULTS_DIR}/logs
```
or within a Jupyter notebook
```
%reload_ext tensorboard
%tensorboard --logdir <logs directory>
```
The logs from this training session are available in the logs directory (tensorboard --logdir=logs).

Google QUEST Q&A Labeling (WIP)

TODO: The metric logger has to be improved in order to know how good the model performs. At the moment we just record the binary cross entropy for every one of the 30 scores that have to be assigned to a question-answer pair.

Train the model (remove size_tr_val to use the complete dataset; size_val refers to the size of the validation dataset):

python exec/train_google_qa.py \
    --data_path ${DATA_DIR}/train.csv \
    --model_dir ${RESULTS_DIR}/models \
    --log_dir ${RESULTS_DIR}/logs \            
    --size_tr_val 100\
    --size_val 40\
    --batch_size 6 \
    --num_epochs 2 \
    --print_freq 10 \
    --seed 10

Make a prediction (only for the first 100 elements from the test set):

python exec/predict_google_qa.py \
    --data_path ${DATA_DIR}/test.csv \
    --result_dir ${RESULTS_DIR}/results \
    --model_dir ${RESULTS_DIR}/models \
    --load_epoch 1 \
    --batch_size 2 \
    --n_el 100

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
exec		exec
figures		figures
logs		logs
nlu		nlu
.env.sample		.env.sample
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Theory

Quick start

About

Uh oh!

Releases

Packages

Languages

ImScientist/NLU

Folders and files

Latest commit

History

Repository files navigation

Theory

Quick start

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages