Skip to content

FsuLauncherComp/text-compress

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TextCompress

This package provides a Python class named TextCompressor for compressing text using various natural language processing techniques, including tokenization and stopword removal, lemmatization, paraphrasing, and summarization.

Motivation

Working with LLMs for Embeddings, I found much of the raw documents/pages I was passing into the context of my prompts generally contained a lot unneccesary text that didn't contribute to the overal meaning of the document/page. This increased the number of tokens being used in the context, which in turn increased the size of the embeddings. I wanted to find a way to reduce the size of the embeddings without losing the meaning of the document/page. This package is very basic first attempt.

Installation

To install the text_compression package:

Clone the repository:

git clone https://github.com/FsuLauncherComp/text-compress.git

Install the text_compression package:

pip install .

Afeter installation, you will need to download the following NLTK data:

  • punkt
  • stopwords
  • wordnet

As well as the following spacy model:

  • en_core_web_sm

This can be done by executing the run() from post_install.py:

from text_compress.post_install import run

run()

Usage

Import the TextCompressor class:

from text_compression import TextCompressor

Create an instance of the TextCompressor class with the input text:

text = "This is an example of text compression using a callable class. Users can choose specific techniques to implement or use all techniques."

compressor = TextCompressor(text)

Call the compress() method with a list of techniques to apply:

compressed_text = compressor.compress(techniques=['remove_stopwords_and_tokenize','generate_paraphrase','generate_summary'])
print(compressed_text)

If no techniques are specified, all available techniques are applied by default:

compressed_text = compressor.compress()
print(compressed_text)

Available techniques:

  • remove_stopwords_and_tokenize: Tokenize the text and remove stopwords.

  • apply_lemmatization: Lemmatize the text.

  • generate_paraphrase: Paraphrase the text using TextBlob.

  • generate_summary: Summarize the text using the LexRank algorithm from the sumy library.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages