A simple search engine for ITHome news.
To setup a conda environment and install related dependencies.
conda create --name bce python=3.10 -y
conda activate bce
git clone https://github.com/vyj97/TinyVectorDB/
cd TinyVectorDB
python -m pip install -r requirements.txt
- Login to your Huggingface account.
- Proceed to https://huggingface.co/settings/tokens and generate a new token (token type=Read)
- Execute command below in your machine and paste the token generated:
huggingface-cli login - Proceed to https://huggingface.co/maidalun1020/bce-embedding-base_v1/, then select "agree and access respository".
python3 search.py --query-sentence [MY QUERY SENTENCE] --limit [LIMIT NUM OF DOCS RETURNED] --num_articles [NUM OF DOCS PARSED FROM ITHOME]
# Example: python3 search.py --query-sentence 資安問題 --limit 3 --num_articles 30
Case 2: Provide query sentence using keyboard input after executing the program (allowing multiple query inputs)
python3 search.py --limit [LIMIT NUM OF DOCS RETURNED] --num_articles [NUM OF DOCS PARSED FROM ITHOME]
# Example: python3 search.py --limit 3 --num_articles 30
目前的做法是在當前的page把所有新聞抓下來後,再前往next page抓新聞,直到滿足所需的文章數。其實,可以事先準備一個網址列表,把不同page的網址存起來。接著,使用asyncio等函式庫實現平行處理,進而同時從多個page抓取新聞,並一次性進行回傳。如此一來,便無需等到當前page處理完,才能處理next page的新聞抓取。
目前的document embeddings為768維,可以考慮使用PCA等演算法將其轉換為較低維度的representation,再來建立KD-Tree,從而減少搜索時的計算量,進而提升搜索效率。