Skip to content

PinPinIre/Final-Year-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

146 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Final Year Project

My final year project for the Computer Science course in Trinity College Dublin.

The project is assessing the performance of 3 machine learning algorithms for topic-modelling and text clustering.

The algorithms being investigated are:

  • LDA (latent Dirichlet allocation)
  • KNN (K-Nearest Neighbours)
  • Word2Vec

To preprocess the the pdf files run the "src/scripts/process_pdfs.sh" script on the the corpus to convert to plain text. The "src/scripts/sort_corpus.sh" script can then be used to sort the files into directories based on their arXiv topics and genrate a log file of their distributions.

Run the "src/run_algorithm.py" python script to generate the models. Run the src/run_similarity.py python script to query the models.

About

Public repository for my final year project for Integrated Computer Science in Trinity College Dublin

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors