Search engine

This project is a realisation of tasks for a university course 'Information search methods'. The course covers different algorithms and methods of search in large amount of data.

1. Language dictionary

Task:

Write a program that builds a dictionary of terms according to a given collection of text files.

Text files are submitted to the input in any format.
The size of text files is not less than 150 K.
The number of text files is at least 10.
Glossary of terms saved to disk.
Evaluate the size of the collection, the total number of words in the collection, and the size of the dictionary.
Substantiate the structure of data
Make an estimation of the complexity of the algorithm
Try several formats for saving the dictionary (serialization of the dictionary, saving to a text file, etc.) and compare the results.

Solution:

I have used a paradigm of parallel programming 'producers and consumers', as well as 'map and reduce' method in order to speed the performance of the task.

1 producer reads the documents and put the data into the queue.
N token-consumers reads the data and split it into the tokens. Tokenizing is done with the help of nltp library. Meanwhile token-consumers are also mappers, who put processed tokens with their positions in text into the queue to later reducing work. Tokenizing algorithm:
1. Split data into words
2. Remove accented chars
3. Expand contractions
4. Remove stopwords
5. Remove special characters
6. Do stemming
M reducers reduce token position by tokens and form dictionaries for every document.
Writer process save dictionaries to memory as a file with as name which corresponds the id of a file. File ids are saved in a separate document.

Supported languages: English

Supported extensions: txt, docx

Test results: processed required volume of documents in ~8-12 seconds (Intel core i7, 8 cores)

2. Bool search

Task:

According to the given collection (10 documents for 150K) of documents to build:

incidence matrix "document term"
inverted index

Justify the selected data storage structures in terms of their effectiveness when increasing data volumes.

Compare the size of the structures that came out.

Make a Boolean search for these structures (both).

Operators: AND, OR, NOT. Format on request at your own discretion

Solution:

See boolean_search folder for detailed information

3. Two-word index and coordinate inverted index

Task:

Construct a two-word index and coordinate inverted index in a collection of documents.

Realize phrase search and search based on distance for each one.

Solution:

See two-word index and coordinate inverted index construction implementation and bool search implementation

4. Suffix tree. Permutation index. 3-gram index

Task:

Construct a suffix tree of vocabulary terms, a permutation index for the dictionary and a 3-gram index for the dictionary. Implement a support for queries with jokers.

Solution:

Algorithms require BTree structure usage, which has been implemented here. Search methods in the structure is implemented here Currently only Suffix Tree is processing but it supports only single wildcard appearing.

5. Index construction

Task:

Choose one of the methods for building an index:

BSBI
SPIMI
MapReduce
Dynamic construction

Using the chosen method build a large collection index.

Solution: BSBI has been implemented.

6. Compressed dictionary and compressed inverted index

Task:

Choose one of these methods and compress the dictionary and inverted collection index:

Dictionary: Dictionary as a strip
Dictionary: Block compression
Inverted file: Gamma codes

Solution:

Block compression for a dictionary and gamma codes for inverted index have been implemented.

Gamma codes are used to encode document ids for space compression.

7. Ranking

Task: Implement a search by query request in collection of documents using zoned ranking.

8. Clustering

Task: Implement a clustering of document collection

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
common		common
dictionary		dictionary
search		search
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search engine

1. Language dictionary

2. Bool search

3. Two-word index and coordinate inverted index

4. Suffix tree. Permutation index. 3-gram index

5. Index construction

6. Compressed dictionary and compressed inverted index

7. Ranking

8. Clustering

About

Uh oh!

Releases

Packages

Languages

License

AstiaSun/Search-Engine

Folders and files

Latest commit

History

Repository files navigation

Search engine

1. Language dictionary

2. Bool search

3. Two-word index and coordinate inverted index

4. Suffix tree. Permutation index. 3-gram index

5. Index construction

6. Compressed dictionary and compressed inverted index

7. Ranking

8. Clustering

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages