Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
db3ec23
merge
Apr 3, 2017
bc9154d
added method to add secondary output layer
Apr 4, 2017
1afbf9f
implemented pre-training on subreddits
Apr 4, 2017
95b779e
create methods for handling pre-training data
Apr 4, 2017
4afbf98
non-working commit to allow for debugg help.
Apr 4, 2017
193811d
changed which variable was used to check if pre-training
Apr 5, 2017
e5be638
only add sec_output when pre-train and add an extra layer after
Apr 5, 2017
d94d187
changed config to not include unnecissary parameters
Apr 5, 2017
fa9859c
cleaned main.py by adding it to builder
Apr 5, 2017
1de1133
sec_output uses softmax since a title only have one subreddit
Apr 5, 2017
5c2db6c
refactored add output method to one method
Apr 5, 2017
2374ed4
Adds the option to choose between GRU and LSTM units
hsson Apr 5, 2017
f46c8ce
Adds a missing variable name change
hsson Apr 5, 2017
634d79b
Changes default dataset in config template
hsson Apr 5, 2017
d73a5c4
merge
Apr 3, 2017
468d07d
added method to add secondary output layer
Apr 4, 2017
2e957a3
implemented pre-training on subreddits
Apr 4, 2017
796a81d
create methods for handling pre-training data
Apr 4, 2017
81c149d
non-working commit to allow for debugg help.
Apr 4, 2017
f5c49ed
changed which variable was used to check if pre-training
Apr 5, 2017
8c5cb78
only add sec_output when pre-train and add an extra layer after
Apr 5, 2017
7383a0b
cleaned main.py by adding it to builder
Apr 5, 2017
157987d
sec_output uses softmax since a title only have one subreddit
Apr 5, 2017
90086ab
refactored add output method to one method
Apr 5, 2017
2561556
Merge branch 'feature/pre-train-subreddit' of github.com:kandidat-hig…
Apr 5, 2017
65cf80c
Refactors away redundant function in model builder
hsson Apr 5, 2017
fbe203f
Removes un-used constant
hsson Apr 5, 2017
ac40573
Removes unused epoch counting for pre-training
hsson Apr 5, 2017
8158619
Merge pull request #15 from kandidat-highlights/feature/pre-train-sub…
hsson Apr 5, 2017
5f2d82a
Removes redundant printing of cross entropy error
hsson Apr 5, 2017
1606e4f
Removes bug with incorrect matrix assignment
hsson Apr 5, 2017
7e70291
Uses a uniformly random embedding matrix as default if no pre-trained…
hsson Apr 5, 2017
2eb5af9
Merge pull request #18 from kandidat-highlights/bugfix/embedding-matrix
Mxiim Apr 5, 2017
961b30c
changed subreddit input to be a one hot vector
Apr 6, 2017
7a2e256
removed bias from l2_reg
Apr 6, 2017
7619051
main handels exceptions
Apr 6, 2017
fc8a0e2
chande to actual variable
Apr 6, 2017
73b5a23
Merge pull request #21 from kandidat-highlights/refactoring/subreddit…
Mxiim Apr 10, 2017
ee1a015
Fixes bug when we would log twice for pretraining and actual training
Mxiim Apr 11, 2017
3358a0e
Fixes bug when we couldnt use concat-input AND pretraining for a model
Mxiim Apr 12, 2017
515489e
only make one call to label_vector
Apr 12, 2017
b856b66
Adds docker support
jonatanalmen Apr 13, 2017
8319891
Automatically run all configs if none is specified
hsson Apr 13, 2017
c1726bb
still need to fill in all ranges
Apr 6, 2017
fcd3bb8
all but the datasets
Apr 9, 2017
112a5d1
added dataset
Apr 10, 2017
f600402
Removes wrong files and adds two more learning rates
Mxiim Apr 10, 2017
a8f23e6
fixed bugg
Apr 10, 2017
97997d2
Changes user count to take UNK user into account
hsson Apr 10, 2017
762e8f0
Removes dataset that doesn't exist
hsson Apr 10, 2017
d01a683
Increases interval between hidden neuron sizes
hsson Apr 10, 2017
c2bc3ef
Increases prediction limit intervals
hsson Apr 10, 2017
4bed82e
Tries more common batch sizes
hsson Apr 10, 2017
8883e78
USe consistent step between lstm neurons
hsson Apr 10, 2017
16acae6
Uses dropout probabilities from original dropout paper
hsson Apr 10, 2017
51bde47
Uses new hyperparameters from dev
hsson Apr 10, 2017
72fdb3b
Adds more possible l2 factors
hsson Apr 10, 2017
9e68557
Adds 0 for hidden layer, adds 400 for lstm_neurons and removes 100 fr…
Mxiim Apr 12, 2017
50c14aa
Fixes indentation
Mxiim Apr 12, 2017
45d6370
network name is guaranteed unique
Apr 13, 2017
d479982
Adds a script that automatically downloads the data needed
hsson Apr 13, 2017
97f0ffb
Refactors all double precision floating points to single precision to…
hsson Apr 13, 2017
2e9a3b4
Merge pull request #25 from kandidat-highlights/refactor/floating-pre…
Mxiim Apr 13, 2017
233b090
Resets graph even if it failed
Mxiim Apr 16, 2017
0a35092
Changed the header to include pre_train_subreddit
Apr 17, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from gcr.io/tensorflow/tensorflow:latest-gpu-py3
WORKDIR /app
COPY ./project /app
RUN pip install -r requirements.txt

5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,8 @@ For more details, take a look at the [dataset repository](https://github.com/kan
## Configuration
To edit configs, take a look at the `config.yaml` file. Please prefer making new configs instead of editing old (for academic purposes). If implementing a new model, make sure to add support for it in the `main.py` file so its configs can be automatically parsed.

## Build/Run with Docker

Build with ```docker build -t YOURTAG .```

Run with ```nvidia-docker run [-v YOURLOGDIR:/app/logs] -t -rm YOURTAG python -u ./YOURENTRYPOINT.py
11 changes: 7 additions & 4 deletions config.template.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@ network:
vocabulary_size: 10000
user_count: 6
max_title_length: 30
validation_data: 'validation_data_top_n_single.csv'
training_data: 'training_data_top_n_single.csv'
testing_data: 'testing_data_top_n_single.csv'
validation_data: 'validation_data_top_5_subreddit_allvotes.csv'
training_data: 'training_data_top_5_subreddit_allvotes.csv'
testing_data: 'testing_data_top_5_subreddit_allvotes.csv'
# Embedding matrix configs:
embedding_size: 150 # Make sure to match pretrained matrix dimensions
trainable_matrix: true
Expand All @@ -20,10 +20,13 @@ network:
learning_rate: 0.5
training_epochs: 5
batch_size: 25
lstm_neurons: 200
rnn_neurons: 200
rnn_unit: 'lstm' # Can be 'gru' or 'lstm', default: 'lstm'
hidden_layers: 0
hidden_neurons: 300
subreddit_input_neurons: 10 #Probebly not the best default value
use_concat_input: false
pre_train_subreddit: false
# Regularisation configs:
use_l2_loss: false
l2_factor: 0.01
Expand Down
5 changes: 4 additions & 1 deletion definitions.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,11 @@
LEARN_RATE = 'learning_rate'
EMBEDD_SIZE = 'embedding_size'
MAX_TITLE_LENGTH = 'max_title_length'
LSTM_NEURONS = 'lstm_neurons'
RNN_NEURONS = 'rnn_neurons'
RNN_UNIT = "rnn_unit"
HIDDEN_NEURONS = 'hidden_neurons'
HIDDEN_LAYERS = 'hidden_layers'
SUB_INPUT_NEURONS = 'subreddit_input_neurons'
USE_CONCAT_INPUT = 'use_concat_input'
BATCH_SIZE = 'batch_size'
TRAINING_EPOCHS = 'training_epochs'
Expand All @@ -55,6 +57,7 @@
TRAINABLE_MATRIX = 'trainable_matrix'
PRE_TRAINED_MATRIX = 'pre_trained_matrix'
USE_PRETRAINED = 'use_pretrained'
USE_PRETRAINED_NET = 'pre_train_subreddit'
VALIDATION_DATA = 'validation_data'
TRAINING_DATA = 'training_data'
TESTING_DATA = 'testing_data'
Expand Down
28 changes: 28 additions & 0 deletions download_data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/bin/bash

mkdir -p resources/datasets
cd resources/datasets

# Download embedding matrices
wget https://github.com/kandidat-highlights/data/raw/master/Glove/vectors100d.tar.gz
tar -xzf vectors100d.tar.gz
rm vectors100d.tar.gz

wget https://github.com/kandidat-highlights/data/raw/master/Glove/vectors150d.tar.gz
tar -xzf vectors150d.tar.gz
rm vectors150d.tar.gz

# Download datasets
wget https://github.com/kandidat-highlights/data/raw/master/allVotes/data_top50_users_subreddit_title_all_votes.tar.gz
tar -xzf data_top50_users_subreddit_title_all_votes.tar.gz
rm data_top50_users_subreddit_title_all_votes.tar.gz

wget https://github.com/kandidat-highlights/data/raw/master/allVotes/data_top5_users_subreddit_title_all_votes.tar.gz
tar -xzf data_top5_users_subreddit_title_all_votes.tar.gz
rm data_top5_users_subreddit_title_all_votes.tar.gz

wget https://github.com/kandidat-highlights/data/raw/master/top50/data_top50_users_subreddit_title.tar.gz
tar -xzf data_top50_users_subreddit_title.tar.gz
rm data_top50_users_subreddit_title.tar.gz

cd ../../
28 changes: 19 additions & 9 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,27 +21,37 @@
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
# ==============================================================================
import sys
import argparse
import tensorflow as tf
from definitions import *
from model.util.networkconfig import yamlconfig as networkconfig
from model.model_builder import ModelBuilder

def main():
""" A main method that creates the model and starts training it """
# Parse arguments
parser = argparse.ArgumentParser(add_help=True)
parser.add_argument('configs', metavar='C', type=int, nargs='+',
parser.add_argument('configs', metavar='C', type=int, nargs='*',
help='Config number to use (can be multiple)')
args = parser.parse_args()
for conf in args.configs if args.configs else range(len(networkconfig)):
try:
print("Starting config ", conf)
config_file = networkconfig[conf]
with tf.Session() as sess:
builder = ModelBuilder(config_file, sess)

for conf in args.configs:
config_file = networkconfig[conf]
with tf.Session() as sess:
builder = ModelBuilder(config_file, sess)
network_model = builder.build()
network_model.train()
network_model.close_writers()
tf.reset_default_graph()
network_model = builder.build()
if config_file[USE_PRETRAINED_NET]:
network_model.train(USE_PRETRAINED_NET)
network_model.train()
network_model.close_writers()
tf.reset_default_graph()
except Exception as e:
print("Config ", networkconfig[conf]["name"], "failed to complete", file=sys.stderr)
print(e, file=sys.stderr)
tf.reset_default_graph()

if __name__ == "__main__":
main()
90 changes: 53 additions & 37 deletions model/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,14 +46,15 @@ def __init__(self, config, session):
self.latest_layer = None
self.output_weights = None
self.output_bias = None
self.l2_term = tf.constant(0, dtype=tf.float64)
self.l2_term = tf.constant(0, dtype=tf.float32)

self.vocabulary_size = config[VOC_SIZE]
self.user_count = config[USER_COUNT]
self.learning_rate = config[LEARN_RATE]
self.embedding_size = config[EMBEDD_SIZE]
self.max_title_length = config[MAX_TITLE_LENGTH]
self.lstm_neurons = config[LSTM_NEURONS]
self.rnn_neurons = config[RNN_NEURONS]
self.rnn_unit = config[RNN_UNIT]
self.batch_size = config[BATCH_SIZE]
self.training_epochs = config[TRAINING_EPOCHS]
self.use_l2_loss = config[USE_L2_LOSS]
Expand All @@ -62,18 +63,23 @@ def __init__(self, config, session):
self.dropout_prob = config[DROPOUT_PROB] # Only used for train op
self.hidden_layers = config[HIDDEN_LAYERS]
self.hidden_neurons = config[HIDDEN_NEURONS]
self.subreddit_input_neurons = config[SUB_INPUT_NEURONS]
self.is_trainable_matrix = config[TRAINABLE_MATRIX]
self.use_pretrained = config[USE_PRETRAINED]
self.use_constant_limit = config[USE_CONSTANT_LIMIT]
self.constant_prediction_limit = config[CONSTANT_PREDICTION_LIMIT]
self.use_concat_input = config[USE_CONCAT_INPUT]
self.use_pretrained_net = config[USE_PRETRAINED_NET]
self.subreddit_count = 0

# Will be set in build_graph
self.input = None
self.subreddit_input = None
self.target = None
self.sec_target = None
self.sigmoid = None
self.train_op = None
self.pre_train_op = None
self.error = None
self.init_op = None
self.saver = None
Expand Down Expand Up @@ -105,6 +111,7 @@ def __init__(self, config, session):

with tf.device("/cpu:0"):
self.data = data.Data(config)
self.subreddit_count = self.data.subreddit_count
if self.use_pretrained:
self.vocabulary_size = len(self.data.embedding_matrix)

Expand Down Expand Up @@ -172,6 +179,7 @@ def validate(self):
epoch, get_val_summary_tensor(val_prec), get_val_summary_tensor(train_prec), \
get_val_summary_tensor(val_recall), get_val_summary_tensor(train_recall)

# Currently not used. Saving for now. Might come in handy later
def validate_batch(self):
""" Validates a batch of data and returns cross entropy error """
with tf.device("/cpu:0"):
Expand All @@ -182,43 +190,40 @@ def validate_batch(self):
self.subreddit_input: batch_sub,
self.target: batch_label})

# TODO funktionen gör alldeles för mycket,
# dela upp utskrift, beräkning och träning
def train(self):
def train(self, use_pretrained_net=False):
""" Trains the model on the dataset """
print("Starting training...")
if use_pretrained_net:
print("Pre-training on subreddits...")
else:
print("Starting training...")

if self.use_pretrained:
self._session.run(self.embedding_init,
feed_dict={self.embedding_placeholder:
self.data.embedding_matrix})
self.train_writer = \
tf.summary.FileWriter(self.logging_dir + '/' + TENSOR_DIR_TRAIN,
self._session.graph)
self.valid_writer = \
tf.summary.FileWriter(self.logging_dir + '/' + TENSOR_DIR_VALID)

old_epoch = 0

if self.epoch.eval(self._session) == 0:
if self.epoch.eval(self._session) == 0 and not use_pretrained_net:
self.validate()

# Train for a specified amount of epochs
for i in self.data.for_n_train_epochs(self.training_epochs,
self.batch_size):
for i in self.data.for_n_train_epochs(self.training_epochs, self.batch_size):
# Debug print out
epoch = self.data.completed_training_epochs
training_error = self.train_batch()
validation_error = self.validate_batch()

# Don't validate so often
if i % (self.data.train_size // self.batch_size // 10) == 0 and i:
done = self.data.percent_of_epoch
print("Validation error: {:f} | Training error: {:f} | Done: {:.0%}"
.format(validation_error, training_error, done))
if not use_pretrained_net:
self.train_batch()

# Don't print so often
if i % (self.data.train_size // self.batch_size // 10) == 0 and i:
done = self.data.percent_of_epoch
print("Epoch comletion: {:.0%}".format(done))
else:
self.train_batch(True)

# Do a full evaluation once an epoch is complete
if epoch != old_epoch:
if epoch != old_epoch and not use_pretrained_net:
self._session.run(self.epoch.assign_add(1))
print("Epoch complete...old ", old_epoch)
self.save_checkpoint()
Expand All @@ -227,25 +232,36 @@ def train(self):

# Save model when done training
self.save_checkpoint()
log_samefile(config=self.config, f1_score_valid=self.f1_score_valid, f1_score_train=self.f1_score_train,
epoch_top=self.epoch_top, prec_valid=self.prec_valid, prec_train=self.prec_train,
recall_valid=self.recall_valid, recall_train=self.recall_train)
if not use_pretrained_net:
log_samefile(config=self.config, f1_score_valid=self.f1_score_valid, f1_score_train=self.f1_score_train,
epoch_top=self.epoch_top, prec_valid=self.prec_valid, prec_train=self.prec_train,
recall_valid=self.recall_valid, recall_train=self.recall_train)

def train_batch(self):
def train_batch(self, pre_train_net=False):
""" Trains for one batch and returns cross entropy error """
with tf.device("/cpu:0"):
batch_input, batch_sub, batch_label = \
self.data.next_train_batch()

self._session.run(self.train_op,
{self.input: batch_input,
self.subreddit_input: batch_sub,
self.target: batch_label})
if not pre_train_net:
batch_input, batch_sub, batch_label = \
self.data.next_train_batch()
else:
batch_input, batch_sub, batch_label = \
self.data.next_pre_train_batch()

if pre_train_net and self.use_concat_input:
self._session.run(self.pre_train_op,
{self.input: batch_input,
self.subreddit_input: batch_sub,
self.sec_target: batch_label})
elif pre_train_net:
self._session.run(self.pre_train_op,
{self.input: batch_input,
self.sec_target: batch_label})
else:
self._session.run(self.train_op,
{self.input: batch_input,
self.subreddit_input: batch_sub,
self.target: batch_label})

return self._session.run(self.error,
feed_dict={self.input: batch_input,
self.subreddit_input: batch_sub,
self.target: batch_label})
def close_writers(self):
""" Close tensorboard writers """
self.train_writer.close()
Expand Down
Loading