Geo Classifier

Results classification for solar observations database

Project Description

The project aims to investigate the use of Machine Learning (ML) technique and more precisely Natural Language Processing (NLP) approach in order to provide a first step toward a better and more accurate knowledge of inner content of solar observations-based repositories going beyond current existing search techniques.

The work is mapped to GEO Recommendation on Data Sharing Principles addressing the FAIR (Findable, Accessible, Interoperable, Reusable) principles that GEO is promoting. https://www.go-fair.org/fair-principles/.

The Findable and Accessible are covered by Metadata and DOI pointing to the GitHub repository. The Reusable item is a new concept that was exemplified in the project. We build up an example of biogenetic, proving that the same algorithm can run in any context.

Install the project

To use this notebooks you will need to setup the notebook environment using a terminal and run: $ conda env create -f geoss_env.yml

If the creation of the virtual environment fails, all the libraries that need to be installed can be found in the import file in the project.

Apart from the Jupyter Notebook sources, it's also necessary to add some external packages in Python3:

$ python:
>>> import nltk
>>> nltk.download('stopwords')
>>> nltk.download('punkt')
>>> nltk.download('averaged_perceptron_tagger')

Then in you notebook you must start the kernel within this environment, it's located at top right, by default you should Have Python [env:default], when the correct environment is selected you should have Python [env:solarobs]

You can change the kernel in the menu Kernel > Change Kernel ...

If the environment is not shown, try to reload the jupiterlab web page.

Run the project

The project contains of 3 general approaches that turned out to be less efficient and a more specialized idea (number 4), that provided better insights.

All the implemented solutions can be found in a different source file and they run over the geocatalog webserver: Each file can be run cell by cell.

Supervised classification based on keywords

A first approach to classification is to get the most relevant keywords in the corpus and use them as labels for supervised classification. The first step is to build a keyword list that can be used for text classification. In order to do this, we get the list of keywords from the metadata records and select the ones that are the similar to the search keyword (above a predefined threshold). For each metadata record, we are looking at the keywords that are attached and, if they are also contained in the list of the best keywords, then the preprocessed metadata record is written in the training file, together with its corresponding best keywords, as labels. The same text will be written also in the test file, so the classifier will be able to find the best category for each metadata record. If the metadata record doesn't have any keyword in the list, it is written to the test file to be labeled through a machine learning algorithm. In order to determine the best label for each metadata record, we used fastText classifier trained on the labeled file

Main file that should be run: Keyword_Clustering.ipynb
External libraries: It requires fastText installation and the path to it in the imports file

K-Means classification

K-means clustering is one of the basic and generic unsupervised classification algorithms of elements in a vectorial-space. Its objective is to group similar data points together (clusters), by following patterns. The first step is to define a number k, which represents the number of clusters that will be formed. The center of each cluster is called centroid. The algorithm builds a corpus of the entries in the database and, using it and word2vec algorithm, it creates connections betweem words to compute similarities. Then, it identifies k centroids and allocates each data point to the nearest cluster. The algorithm starts with a first group of random centroids, then it performs repetitive calculations to optimize their positions. It stops whether the pre-set number of iterations has been reached or if there is no change in the positions of the centers. In case of natural language, processing , word2vec embeddings are used in order to determine the position of each word in the multidimensional space.

Main file that should be run: K-Means Classification.ipynb

Latent Semantic Analysis clustering

One of the most common techniques for automatically obtaining topics is Latent semantic analysis (LSA). It builds relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.

As input, it received a corpus of documents and returns a number of different topics from it, together with the most prominent words of each. It uses bag of word(BoW) model, which results in a term-document matrix(occurrence of terms in a document). It builds a term frequency-inverse document frequency (tf-idf) where each position in the vector corresponds to a different word and a document is represented by the number of times each word appears. So, the most important words will be the ones that appear the most often in the documents. The LSA algorithms improve the process by also considering synonymy between words. It learns latent topics by performing a matrix decomposition on the document-term matrix using Singular value decomposition (SVD). This is a matrix factorization method that represents a matrix in the product of two matrices.

Main file that should be run: LDA_clustering.ipynb

Supervised classifier - Thematic Words

In case of short texts, as metadata records, the best approach is to build up a hierarchy of pre-defined words related to the topic and assign each text to those categories.

The approach in this case is the following: - run an unsupervised classifier for short texts to obtain several topics - build a similarity matrix between each set of expert labels and the obtained topics - add the similarities between each text and its topic to the matrix - for each topic, arrange the results in descending order, based on the similarity

Main file that should be run: Supervised Classifier-Thematic Words.ipynb

Documentation

More information can be found in the documentation from the doc folder and in the source files.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Biogenetic @ f170c0f		Biogenetic @ f170c0f
Biogenetic Example		Biogenetic Example
gsdmm @ f170c0f		gsdmm @ f170c0f
.gitmodules		.gitmodules
Build Language Model-Biologic Process.ipynb		Build Language Model-Biologic Process.ipynb
Build Language Model.ipynb		Build Language Model.ipynb
Code Reusability Tutorial		Code Reusability Tutorial
Common Defines Biologic Process.ipynb		Common Defines Biologic Process.ipynb
Common Defines.ipynb		Common Defines.ipynb
K-Means Classification.ipynb		K-Means Classification.ipynb
Keyword_Clustering.ipynb		Keyword_Clustering.ipynb
LDA_clustering.ipynb		LDA_clustering.ipynb
ML-NLP-Support-to-GEO.pdf		ML-NLP-Support-to-GEO.pdf
NLP_clustering.ipynb		NLP_clustering.ipynb
Predefined Labels - Biologic Process.py		Predefined Labels - Biologic Process.py
Predefined Labels.ipynb		Predefined Labels.ipynb
README.md		README.md
Supervised Classifier-Thematic Words.ipynb		Supervised Classifier-Thematic Words.ipynb
Utils_Zenodo.ipynb		Utils_Zenodo.ipynb
__imports__.ipynb		__imports__.ipynb
geoss_env.yml		geoss_env.yml
solar_energy_2.bin		solar_energy_2.bin
solarobs.yml		solarobs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Geo Classifier

Results classification for solar observations database

Project Description

Install the project

Run the project

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

iufl/Geo_Classifier

Folders and files

Latest commit

History

Repository files navigation

Geo Classifier

Results classification for solar observations database

Project Description

Install the project

Run the project

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages